Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Tika Wiki" for change 
notification.

The "TikaEvalAndStructuralComponents" page has been changed by TimothyAllison:
https://wiki.apache.org/tika/TikaEvalAndStructuralComponents

New page:
'''NOTE: THIS IS A PAGE IN PROGRESS AND SHOULD BE VIEWED AS A VERY ROUGH 
DRAFT''' 

= Evaluating Structural Components in Extracted Content with the tika-eval 
Module =

'''NOTE:''' This page assumes basic knowledge of the tika-eval workflow.  
Please see TikaEval and make sure that you understand how the tika-eval modules 
works on text before considering structure.

File formats often contain structural or stylistic elements, and Apache Tika 
attempts to normalize and represent some of these features in its XHTML output. 
 As of Tika 1.20, users can get counts of common XHTML tags (in Profile mode) 
and/or comparison counts of common XHTML tags (in Compare mode).  Users can 
also count "tag exceptions" -- cases where the structure tags violate XML/XHTML 
requirements, e.g. `<b><i></b></i>`.

= Known Limitations =
 * Simply counting structure tags offers only a rudimentary insight into the 
structure of a single extract or as a comparison between two extracts of the 
same source file.  One might want to apply a more advanced tree-based 
similarity/distance metric between two extracts -- our JIRA is open and 
committers are standing by.
 * If one one tool's extracts have more `<p>` elements than do another tool's 
that doesn't necessarily tell you that one extract is better than another. 
 For example, one tool (Tool A) might add `<p>` elements for every new line in 
a PDF:
   `<p>The quick brown fox</p>`
   `<p>jumped over the lazy dog</p>`

  Another tool (Tool B ) might apply heuristics to reconstruct logical 
paragraphs, such as
   `<p>The quick brown fox jumped over the lazy dog. </p>`

Tool A would have more `<p>` tags, but Tool B is probably capturing better 
information about the structure of the document.
 
= Intended Uses/Scope = 

= How to Count Structural Components =

If you are using Tika to generate .json files, follow the directions on 
TikaEval for how to create a directory of extracts, but don't include the `-t` 
option: `java -jar tika-app.X.Y.jar -J -i input_dir -o extracts`.  This has the 
effect of storing the content that is extracted as XHTML, and it sets a 
metadata value of `ToXMLContentHandler` for the key `X-TIKA:content_handler`.  
When tika-eval finds that value set in the metadata, it parses the XHTML with a 
SAXParser to count the structure tags and extract the text.

= Handling 

Reply via email to