Hi all,

While working on TIKA-1102, I was looking at tika-mimetypes.xml, specifically 
the section for text/html.

I was curious about offsets specified for various HTML elements, e.g.

      <match value="&lt;html" type="string" offset="0:8192"/>
      <match value="&lt;HTML" type="string" offset="0:64"/>
      <match value="&lt;BODY" type="string" offset="0"/>
      <match value="&lt;body" type="string" offset="0"/>
      <match value="&lt;DIV" type="string" offset="0"/>
      <match value="&lt;div" type="string" offset="0"/>
      <match value="&lt;TITLE" type="string" offset="0"/>
      <match value="&lt;title" type="string" offset="0"/>
      <match value="&lt;h1" type="string" offset="0"/>
      <match value="&lt;H1" type="string" offset="0"/>
      <match value="&lt;!doctype HTML" type="string" offset="0"/>
      <match value="&lt;!DOCTYPE html" type="string" offset="0"/>

Why are most set to 0, but some set to a range?

Wouldn't we want all of these to be a (reasonable) range, so if the text was "  
   <body>", for example, it would match?

I added an entry for div that used offset="0", so as to follow convention, but 
that seems wrong.

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr





Reply via email to