[ https://issues.apache.org/jira/browse/NUTCH-766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12832454#action_12832454 ]
Julien Nioche commented on NUTCH-766: ------------------------------------- @Chris : I just did a fresh co from svn, applied the patch v3 and unzipped sample.tar.gz onto the directory parse-tika and ran the test just as you did but could not reproduce the problem. Could there be a difference between your version and the trunk? @Sami : {quote} was there a reason not to use AutoDetect parser? {quote} I suppose we could as long we give it a clue about the MimeType obtained from the Content. As you pointed out, there could be a duplication with the detection done by Mime-Util. I suppose one way to do would be to add a new version of the method getParse(Content conte, MimeType type). That's an interesting point. {quote} Also was there a reson not to parse html wtih tika? {quote} It is supposed to do so, if it does not then it's a bug which needs urgent fixing. Regarding parsing package formats, I think the plan is that Tika will handle that in the future but we could try to do that now if we find a relatively clean mechanism for doing so. BTW could you please send a diff and not the full code of the class you posted earlier, that would make the comparison much easier. > Tika parser > ----------- > > Key: NUTCH-766 > URL: https://issues.apache.org/jira/browse/NUTCH-766 > Project: Nutch > Issue Type: New Feature > Reporter: Julien Nioche > Assignee: Chris A. Mattmann > Fix For: 1.1 > > Attachments: NUTCH-766-v3.patch, NUTCH-766.v2, NutchTikaConfig.java, > sample.tar.gz, TikaParser.java > > > Tika handles a lot of different formats under the bonnet and exposes them > nicely via SAX events. What is described here is a tika-parser plugin which > delegates the pasring mechanism of Tika but can still coexist with the > existing parsing plugins which is useful for formats partially handled by > Tika (or not at all). Some of the elements below have already been discussed > on the mailing lists. Note that this is work in progress, your feedback is > welcome. > Tika is already used by Nutch for its MimeType implementations. Tika comes as > different jar files (core and parsers), in the work described here we decided > to put the libs in 2 different places > NUTCH_HOME/lib : tika-core.jar > NUTCH_HOME/tika-plugin/lib : tika-parsers.jar > Tika being used by the core only for its Mimetype functionalities we only > need to put tika-core at the main lib level whereas the tika plugin obviously > needs the tika-parsers.jar + all the jars used internally by Tika > Due to limitations in the way Tika loads its classes, we had to duplicate the > TikaConfig class in the tika-plugin. This might be fixed in the future in > Tika itself or avoided by refactoring the mimetype part of Nutch using > extension points. > Unlike most other parsers, Tika handles more than one Mime-type which is why > we are using "*" as its mimetype value in the plugin descriptor and have > modified ParserFactory.java so that it considers the tika parser as > potentially suitable for all mime-types. In practice this means that the > associations between a mime type and a parser plugin as defined in > parse-plugins.xml are useful only for the cases where we want to handle a > mime type with a different parser than Tika. > The general approach I chose was to convert the SAX events returned by the > Tika parsers into DOM objects and reuse the utilities that come with the > current HTML parser i.e. link detection, metatag handling but also means > that we can use the HTMLParseFilters in exactly the same way. The main > difference though is that HTMLParseFilters are not limited to HTML documents > anymore as the XHTML tags returned by Tika can correspond to a different > format for the original document. There is a duplication of code with the > html-plugin which will be resolved by either a) getting rid of the > html-plugin altogether or b) exporting its jar and make the tika parser > depend on it. > The following libraries are required in the lib/ directory of the tika-parser > : > <library name="asm-3.1.jar"/> > <library name="bcmail-jdk15-144.jar"/> > <library name="commons-compress-1.0.jar"/> > <library name="commons-logging-1.1.1.jar"/> > <library name="dom4j-1.6.1.jar"/> > <library name="fontbox-0.8.0-incubator.jar"/> > <library name="geronimo-stax-api_1.0_spec-1.0.1.jar"/> > <library name="hamcrest-core-1.1.jar"/> > <library name="jce-jdk13-144.jar"/> > <library name="jempbox-0.8.0-incubator.jar"/> > <library name="metadata-extractor-2.4.0-beta-1.jar"/> > <library name="mockito-core-1.7.jar"/> > <library name="objenesis-1.0.jar"/> > <library name="ooxml-schemas-1.0.jar"/> > <library name="pdfbox-0.8.0-incubating.jar"/> > <library name="poi-3.5-FINAL.jar"/> > <library name="poi-ooxml-3.5-FINAL.jar"/> > <library name="poi-scratchpad-3.5-FINAL.jar"/> > <library name="tagsoup-1.2.jar"/> > <library name="tika-parsers-0.5-SNAPSHOT.jar"/> > <library name="xml-apis-1.0.b2.jar"/> > <library name="xmlbeans-2.3.0.jar"/> > There is a small test suite which needs to be improved. We will need to have > a look at each individual format and check that it is covered by Tika and if > so to the same extent; the Wiki is probably the right place for this. The > language identifier (which is a HTMLParseFilter) seemed to work fine. > > Again, your comments are welcome. Please bear in mind that this is just a > first step. > Julien > http://www.digitalpebble.com -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.