Hi Tyler, Can you tell me more about the tika-mimetypes.xml file? Is this a new 'required' file? I'm not 100% sure about this yet, but it seems to me that, since MimeTypeUtils.java instantiates Tika with the default constructor, and never explicitly tells Tika which mime-types file to use (even though the correct mime-types.xml file is passed to the MimeTypeUtils constructor from MimeExtractorRepo) there is no place where the contents of my mime-types.xml file is being read and stored in the Tika's MimeTypeRegistry, and by default tika only knows about xml files, text files, application/octet-stream files.
I will keep looking at this tomorrow and verify which the file that is passed to the Tika's MimeTypesFactory class, but I have to head home now. Val Valerie A. Mallder New Horizons Deputy Mission System Engineer Johns Hopkins University/Applied Physics Laboratory -----Original Message----- From: Mallder, Valerie Sent: Thursday, January 22, 2015 11:42 AM To: dev Subject: RE: Tyler - I may need your help Hi Tyler, I have defined a few custom mime types in my filemgr/etc/mime-types.xml file. The contents of my file looks exactly like the contents of http://svn.apache.org/viewvc/oodt/tags/0.8/filemgr/src/main/resources/mime-types.xml with the addition of project-specific mime-types . The tika-mimetypes.xml file you pointed me to has ~2000 additional lines in it as compared to the http://svn.apache.org/viewvc/oodt/tags/0.8/filemgr/src/main/resources/mime-types.xml file and the http://svn.apache.org/viewvc/oodt/tags/0.8/mvn/archetypes/radix/src/main/resources/archetype-resources/filemgr/src/main/resources/etc/mime-types.xml file. So, it is definitely different than the one I've been using. But, I copied it over and added my mime types to it, and it didn't help. The mime types it is returning are 'reasonable' mime-types to return, they are just not the mime-types that I defined them as. For instance, I have *.sfdu files and *.out files that contain binary data, and tika says they are "application/octet-stream" files. I also have *.ecsv files that contain text, and tika says they are "text/plain" files. But here are the mime-types I defined for these files for my project, and these are the mime-types that have defined extractors for. None of these filename extensions "*.out, *.ecsv, and *.sfdu" are defined elsewhere in the mime-types.xml file. <mime-type type="product/fei-out"> <glob pattern="*.out"/> </mime-type> <mime-type type="product/fei-ecsv"> <glob pattern="*.ecsv"/> </mime-type> <mime-type type="product/fei-sfdu"> <glob pattern="*.sfdu"/> </mime-type> I'm a newbie with Java and I can't guarantee I would be able to build a JUnit test program very easily. But I will continue to investigate and see what I can do. Thanks! Val Valerie A. Mallder New Horizons Deputy Mission System Engineer Johns Hopkins University/Applied Physics Laboratory > -----Original Message----- > From: Tyler Palsulich [mailto:[email protected]] > Sent: Wednesday, January 21, 2015 5:13 PM > To: dev > Subject: Re: Tyler - I may need your help > > Hi Val, > > Hmm... Is there a particular (wrong) mime-type that keeps getting > detected (like text/plain, or something)? I'm curious if the type is > just returning a default. Or, is it a seemingly random file type? What are > the contents of your mime-types.xml file? > If it's different than > https://raw.githubusercontent.com/apache/tika/trunk/tika- > core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml, > can you try copying it over? > > I'm not sure I'll be able to replicate your error on my computer > without a bit of difficulty. Do you think there is any way you could > create a JUnit test case with the problem? > > Tyler > > > On Wed, Jan 21, 2015 at 1:26 PM, Mallder, Valerie < > [email protected]> > wrote: > > > Hi Tyler, > > > > I'm have been looking into an issue that cropped up in my OODT > > system when I upgraded to OODT 0.8. The issue is, my > > AutoDetectProductCrawler, which is launched from a PGETaskInstance > > is unable to determine the mime-type for my product files. I am > > using the same filemgr/etc/mime-types.xml file that I was using with > > OODT 0.7, and I am using the same > > oodt/extensions/policy/mime-extractor-map.xml file that I was using > > with OODT 0.7, but now, in MimeTypeRepo::getExtractorSpecsForFile, > > the call to > > this.mimeRepo.getMimeType(file) is returning the wrong mime-types > > for all of my files, and so the AutoDetectProductCrawler is telling > > me I have no extractor specs for my files. > > > > I noticed that you did some work on MimeTypeUtils for OODT-630 in > > OODT 0.8. At first glance, it doesn't' look like any of this work > > would be directly responsible. Can you think of anything that might > > be causing this to happen? I don't know anything about tika. Do I > > need to make any changes to my policy files to remain compatible. > > Just looking for clues on how to resolve this. I have verified by > > adding log messages throughout the code that, prior to launching the > > AutoDetectProductCrawler, all of the policy files are read correctly. > > The MimeExtractorConfigReader is reading the correct > > mim-extractor-map.xml file, and it is calling setMimeRepoFile with > > the correct mime-types.xml file, and it is setting the correct > > extractor config file, etc. But, once AutoDetectProductCrawler > > starts crawling it try to getExtractorSpecsForFile but determines > > the wrong mime type and then > can't find the extractor spec. > > > > Thanks, > > Val > > > > > > > > Valerie A. Mallder > > > > New Horizons Deputy Mission System Engineer The Johns Hopkins > > University/Applied Physics Laboratory > > 11100 Johns Hopkins Rd (MS 23-282), Laurel, MD 20723 > > 240-228-7846 (Office) 410-504-2233 (Blackberry) > > > >
