Hi Abhinav, Have you tried reading the 5 minute Parse guide on the website ( http://tika.apache.org/1.7/parser_guide.html)? That should help give you an idea of how to create a new Parser.
Tika is split into multiple components. Each component is responsible for a different feature of Tika. tika-core contains the main interfaces of Tika -- Parser <https://github.com/apache/tika/blob/trunk/tika-core/src/main/java/org/apache/tika/parser/Parser.java>, Detector <https://github.com/apache/tika/blob/trunk/tika-core/src/main/java/org/apache/tika/detect/Detector.java>, and Translator <https://github.com/apache/tika/blob/trunk/tika-core/src/main/java/org/apache/tika/language/translate/Translator.java>. If a project uses only tika-core as a dependency, they won't have access to all of the existing parser libraries Tika has implemented. In order to parse a wide variety of file types, you have to include tika-parsers. As the name implies, this contains Tika's... Parsers! Similarly, there are a few implementations of the Translator interface in the tika-translate component. A Parser's job is to implement the parse <https://github.com/apache/tika/blob/trunk/tika-core/src/main/java/org/apache/tika/parser/Parser.java#L45> method. The method consumes an input stream, extracts content to feed into the ContentHandler, and extracts Metadata to put into the Metadata object. There is also some configuration that can be passed through the ParseContext argument. You register new Parsers in this file <https://github.com/apache/tika/blob/trunk/tika-parsers/src/main/resources/META-INF/services/org.apache.tika.parser.Parser>. This file is dynamically loaded to decide which Parser to use for a given file. tika-example contains some samples of how to interact with and call Tika. tika-server contains a standalone server you can start and make web requests to. See http://162.209.99.130:8080/ for a VM running tika-server (which may not actually be running right now). You can also browse around the Tika Wiki <http://wiki.apache.org/tika/>. Here <https://github.com/apache/tika/commit/ea33dd3e33ff3051d9637622f6fbdb4d2f8c4859> is a commit which added a new parser for GRIB formats. Chris Mattmann and Jukka Zitting wrote a book about Tika's design: Tika in Action <http://www.manning.com/mattmann/>. Check out the contributors <http://tika.apache.org/contribute.html> page for some good information and links. Others, feel free to correct/elaborate on anything. Do we have any publicly available overall design documentation? I hope that helps, Tyler On Tue, Feb 17, 2015 at 2:30 PM, Abhinav Gupta <[email protected]> wrote: > Hi, > > I'm interested in taking up the bug Tika-1456 but I am not sure how to > tackle it. Before I can come up with a solution of implementing it, I need > to understand how different parsers are integrated with tika. Is there any > resource that I can read in order to understand that ? > > I am also not sure of how to make sense of the large source code. How can I > get a better idea of it ? > > Thank you very much for your time :) > > Regards, > Abhinav >
