Hi all,
While working on TIKA-1102, I was looking at tika-mimetypes.xml, specifically
the section for text/html.
I was curious about offsets specified for various HTML elements, e.g.
<match value="<html" type="string" offset="0:8192"/>
<match value="<HTML" type="string" offset="0:64"/>
<match value="<BODY" type="string" offset="0"/>
<match value="<body" type="string" offset="0"/>
<match value="<DIV" type="string" offset="0"/>
<match value="<div" type="string" offset="0"/>
<match value="<TITLE" type="string" offset="0"/>
<match value="<title" type="string" offset="0"/>
<match value="<h1" type="string" offset="0"/>
<match value="<H1" type="string" offset="0"/>
<match value="<!doctype HTML" type="string" offset="0"/>
<match value="<!DOCTYPE html" type="string" offset="0"/>
Why are most set to 0, but some set to a range?
Wouldn't we want all of these to be a (reasonable) range, so if the text was "
<body>", for example, it would match?
I added an entry for div that used offset="0", so as to follow convention, but
that seems wrong.
-- Ken
--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr