Re: Parse-tika ignores too much data...

2010-07-08 Thread Andrzej Bialecki

On 2010-07-07 22:32, Ken Krugler wrote:

Hi Julien,


See https://issues.apache.org/jira/browse/TIKA-457 for a description
of one of the cases found by Andrzej. There seems to be something very
wrong with the way body is handled, we also saw cases were it was
twice in the output.


Don't know about the case of it appearing twice.

But for the above issue, I added a comment. The test HTML is badly
broken, in that you can either have a body OR a frameset, but not both.


The HTML was broken on purpose - one of the goals of the original test 
was to get as much content and links in presence of grave errors - as 
you know even major sites often produce a badly broken HTML, but the 
parser sanitize it and produce a valid DOM. In this case, it produced 
two nested body elements, which is not valid. I should also mention 
that NekoHTML handled this test much better, by removing the body and 
retaining only the frameset.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Parse-tika ignores too much data...

2010-07-07 Thread Ken Krugler

Hi Andrzej,

I've got a old list of cases where Tika was not extracting links:

 - frame
 - iframe
 - img
 - map
 - object
 - link (only in head section)

I worked around this in my crawling code, by directly processing the  
DOM, but I should roll this into Tika.


If you have a list of problems with test docs, file a TIKA issue and  
I'll try to fix things up quickly.


Thanks,

-- Ken

On Jul 7, 2010, at 5:55am, Andrzej Bialecki wrote:


Hi,

I'm going through NUTCH-840, and I tried to eat our own dog food,  
i.e. prepare the test DOM-s with Tika's HtmlParser.


Results are not so good for some test cases... Even when using  
IdentityHtmlMapper Tika ignores some elements (such as frame/ 
frameset) and for some others (area) it drops the href. As a result,  
the number of valid outlinks collected with parse-tika is much  
smaller than with parse-html.


I know this issue has been reported (TIKA-379, NUTCH-817,  
NUTCH-794), and a partial fix was applied to Tika 0.8, but still  
this won't handle the problems I mentioned above.


Can we come up with a plan to address this? I'd rather switch  
completely to Tika-s HTML parsing, but at the moment we would lose  
too much useful data...


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Re: Parse-tika ignores too much data...

2010-07-07 Thread Ken Krugler

Hi Julien,

See https://issues.apache.org/jira/browse/TIKA-457 for a description  
of one of the cases found by Andrzej. There seems to be something  
very wrong with the way body is handled, we also saw cases were it  
was twice in the output.


Don't know about the case of it appearing twice.

But for the above issue, I added a comment. The test HTML is badly  
broken, in that you can either have a body OR a frameset, but not  
both.


-- Ken


On 7 July 2010 17:41, Ken Krugler kkrugler_li...@transpac.com wrote:
Hi Andrzej,

I've got a old list of cases where Tika was not extracting links:

 - frame
 - iframe
 - img
 - map
 - object
 - link (only in head section)

I worked around this in my crawling code, by directly processing the  
DOM, but I should roll this into Tika.


If you have a list of problems with test docs, file a TIKA issue and  
I'll try to fix things up quickly.


Thanks,

-- Ken


On Jul 7, 2010, at 5:55am, Andrzej Bialecki wrote:

Hi,

I'm going through NUTCH-840, and I tried to eat our own dog food,  
i.e. prepare the test DOM-s with Tika's HtmlParser.


Results are not so good for some test cases... Even when using  
IdentityHtmlMapper Tika ignores some elements (such as frame/ 
frameset) and for some others (area) it drops the href. As a result,  
the number of valid outlinks collected with parse-tika is much  
smaller than with parse-html.


I know this issue has been reported (TIKA-379, NUTCH-817,  
NUTCH-794), and a partial fix was applied to Tika 0.8, but still  
this won't handle the problems I mentioned above.


Can we come up with a plan to address this? I'd rather switch  
completely to Tika-s HTML parsing, but at the moment we would lose  
too much useful data...


--
Best regards,
Andrzej Bialecki 
___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g







--
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com



Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






Re: Parse-tika ignores too much data...

2010-07-07 Thread Julien Nioche
Hi Ken,

Thank you for your comments and analysis. We should probably modify the
HTMLHandler so that it does not discard a  frameset because of the bodylevel
being equal to 0. I suggested earlier on the Tika list having a mechanism
for specifying a custom handler via the Context, that would give us the
option in Nutch to implement the logic we want i.e. ignore the body level if
we want to.

Thanks

J.

On 7 July 2010 21:32, Ken Krugler kkrugler_li...@transpac.com wrote:

 Hi Julien,

 See https://issues.apache.org/jira/browse/TIKA-457 for a description of
 one of the cases found by Andrzej. There seems to be something very wrong
 with the way body is handled, we also saw cases were it was twice in the
 output.


 Don't know about the case of it appearing twice.

 But for the above issue, I added a comment. The test HTML is badly broken,
 in that you can either have a body OR a frameset, but not both.

 -- Ken

 On 7 July 2010 17:41, Ken Krugler kkrugler_li...@transpac.com wrote:

 Hi Andrzej,

 I've got a old list of cases where Tika was not extracting links:

  - frame
  - iframe
  - img
  - map
  - object
  - link (only in head section)

 I worked around this in my crawling code, by directly processing the DOM,
 but I should roll this into Tika.

 If you have a list of problems with test docs, file a TIKA issue and I'll
 try to fix things up quickly.

 Thanks,

 -- Ken


 On Jul 7, 2010, at 5:55am, Andrzej Bialecki wrote:

  Hi,

 I'm going through NUTCH-840, and I tried to eat our own dog food, i.e.
 prepare the test DOM-s with Tika's HtmlParser.

 Results are not so good for some test cases... Even when using
 IdentityHtmlMapper Tika ignores some elements (such as frame/frameset) and
 for some others (area) it drops the href. As a result, the number of valid
 outlinks collected with parse-tika is much smaller than with parse-html.

 I know this issue has been reported (TIKA-379, NUTCH-817, NUTCH-794), and
 a partial fix was applied to Tika 0.8, but still this won't handle the
 problems I mentioned above.

 Can we come up with a plan to address this? I'd rather switch completely
 to Tika-s HTML parsing, but at the moment we would lose too much useful
 data...

 --
 Best regards,
 Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
 [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
 ___|||__||  \|  ||  |  Embedded Unix, System Integration
 http://www.sigram.com  Contact: info at sigram dot com


 
 Ken Krugler
 +1 530-210-6378
 http://bixolabs.com
 e l a s t i c   w e b   m i n i n g







 --
 DigitalPebble Ltd

 Open Source Solutions for Text Engineering
 http://www.digitalpebble.com


 
 Ken Krugler
 +1 530-210-6378
 http://bixolabs.com
 e l a s t i c   w e b   m i n i n g







-- 
DigitalPebble Ltd

Open Source Solutions for Text Engineering
http://www.digitalpebble.com