On Jul 8, 2010, at 12:15am, Andrzej Bialecki wrote:
On 2010-07-07 22:32, Ken Krugler wrote:
Hi Julien,
See https://issues.apache.org/jira/browse/TIKA-457 for a description
of one of the cases found by Andrzej. There seems to be something
very
wrong with the way is handled, we also saw ca
On 2010-07-07 22:32, Ken Krugler wrote:
Hi Julien,
See https://issues.apache.org/jira/browse/TIKA-457 for a description
of one of the cases found by Andrzej. There seems to be something very
wrong with the way is handled, we also saw cases were it was
twice in the output.
Don't know about th
Hi Ken,
Thank you for your comments and analysis. We should probably modify the
HTMLHandler so that it does not discard a frameset because of the bodylevel
being equal to 0. I suggested earlier on the Tika list having a mechanism
for specifying a custom handler via the Context, that would give us
Hi Julien,
See https://issues.apache.org/jira/browse/TIKA-457 for a description
of one of the cases found by Andrzej. There seems to be something
very wrong with the way is handled, we also saw cases were it
was twice in the output.
Don't know about the case of it appearing twice.
But f
Ken,
See https://issues.apache.org/jira/browse/TIKA-457 for a description of one
of the cases found by Andrzej. There seems to be something very wrong with
the way is handled, we also saw cases were it was twice in the
output.
J.
On 7 July 2010 17:41, Ken Krugler wrote:
> Hi Andrzej,
>
> I've
Hi Andrzej,
I've got a old list of cases where Tika was not extracting links:
- frame
- iframe
- img
- map
- object
- link (only in section)
I worked around this in my crawling code, by directly processing the
DOM, but I should roll this into Tika.
If you have a list of problems with
Hi,
I'm going through NUTCH-840, and I tried to eat our own dog food, i.e.
prepare the test DOM-s with Tika's HtmlParser.
Results are not so good for some test cases... Even when using
IdentityHtmlMapper Tika ignores some elements (such as frame/frameset)
and for some others (area) it drops
7 matches
Mail list logo