[ 
https://issues.apache.org/jira/browse/NUTCH-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-840:
---------------------------------------
    Attachment: NUTCH-840-2.x.patch

Patch for 2.X.
There currently appears to be a discrepancy in the detection of Outlunks. We 
are detecting more than the test expects

{code}
  1 Testsuite: org.apache.nutch.parse.tika.TestDOMContentUtils
  2 Tests run: 3, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.496 sec
  3
  4 Testcase: testGetTitle took 0.331 sec
  5 Testcase: testGetText took 0.069 sec
  6 Testcase: testGetOutlinks took 0.08 sec
  7         FAILED
  8 got wrong number of outlinks (expecting 3, got 5)
  9 answer:
 10 toUrl: http://www.nutch.org/ anchor: home
 11 toUrl: http://www.nutch.org/docs/1 anchor: 1
 12 toUrl: http://www.nutch.org/docs/2 anchor: 2
 13
 14 got:
 15 toUrl: http://www.nutch.org/ anchor: home
 16 toUrl: http://www.nutch.org/ anchor:
 17 toUrl: http://www.nutch.org/docs/1 anchor: 1
 18 toUrl: http://www.nutch.org/docs/1 anchor:
 19 toUrl: http://www.nutch.org/docs/2 anchor: 2
 20
 21
 22 junit.framework.AssertionFailedError: got wrong number of outlinks 
(expecting 3, got 5)
 23 answer:
 24 toUrl: http://www.nutch.org/ anchor: home
 25 toUrl: http://www.nutch.org/docs/1 anchor: 1
 26 toUrl: http://www.nutch.org/docs/2 anchor: 2
 27
 28 got:
 29 toUrl: http://www.nutch.org/ anchor: home
 30 toUrl: http://www.nutch.org/ anchor:
 31 toUrl: http://www.nutch.org/docs/1 anchor: 1
 32 toUrl: http://www.nutch.org/docs/1 anchor:
 33 toUrl: http://www.nutch.org/docs/2 anchor: 2
 34
 35
 36         at 
org.apache.nutch.parse.tika.TestDOMContentUtils.compareOutlinks(TestDOMContentUtils.ja
    va:315)
 37         at 
org.apache.nutch.parse.tika.TestDOMContentUtils.testGetOutlinks(TestDOMContentUtils.ja
    va:296)
{code}

> Port tests from parse-html to parse-tika
> ----------------------------------------
>
>                 Key: NUTCH-840
>                 URL: https://issues.apache.org/jira/browse/NUTCH-840
>             Project: Nutch
>          Issue Type: Task
>          Components: parser
>    Affects Versions: 1.1, 1.6
>            Reporter: Julien Nioche
>             Fix For: 2.4
>
>         Attachments: NUTCH-840-2.x.patch, NUTCH-840-trunk.patch, 
> NUTCH-840.patch, NUTCH-840.patch, NUTCH-840v2.patch
>
>
> We don't have test for HTML in parse-tika so I'll copy them from the old 
> parse-html plugin



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to