[
https://issues.apache.org/jira/browse/NUTCH-840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Lewis John McGibbney updated NUTCH-840:
---------------------------------------
Attachment: NUTCH-840-2.x.patch
Patch for 2.X.
There currently appears to be a discrepancy in the detection of Outlunks. We
are detecting more than the test expects
{code}
1 Testsuite: org.apache.nutch.parse.tika.TestDOMContentUtils
2 Tests run: 3, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 0.496 sec
3
4 Testcase: testGetTitle took 0.331 sec
5 Testcase: testGetText took 0.069 sec
6 Testcase: testGetOutlinks took 0.08 sec
7 FAILED
8 got wrong number of outlinks (expecting 3, got 5)
9 answer:
10 toUrl: http://www.nutch.org/ anchor: home
11 toUrl: http://www.nutch.org/docs/1 anchor: 1
12 toUrl: http://www.nutch.org/docs/2 anchor: 2
13
14 got:
15 toUrl: http://www.nutch.org/ anchor: home
16 toUrl: http://www.nutch.org/ anchor:
17 toUrl: http://www.nutch.org/docs/1 anchor: 1
18 toUrl: http://www.nutch.org/docs/1 anchor:
19 toUrl: http://www.nutch.org/docs/2 anchor: 2
20
21
22 junit.framework.AssertionFailedError: got wrong number of outlinks
(expecting 3, got 5)
23 answer:
24 toUrl: http://www.nutch.org/ anchor: home
25 toUrl: http://www.nutch.org/docs/1 anchor: 1
26 toUrl: http://www.nutch.org/docs/2 anchor: 2
27
28 got:
29 toUrl: http://www.nutch.org/ anchor: home
30 toUrl: http://www.nutch.org/ anchor:
31 toUrl: http://www.nutch.org/docs/1 anchor: 1
32 toUrl: http://www.nutch.org/docs/1 anchor:
33 toUrl: http://www.nutch.org/docs/2 anchor: 2
34
35
36 at
org.apache.nutch.parse.tika.TestDOMContentUtils.compareOutlinks(TestDOMContentUtils.ja
va:315)
37 at
org.apache.nutch.parse.tika.TestDOMContentUtils.testGetOutlinks(TestDOMContentUtils.ja
va:296)
{code}
> Port tests from parse-html to parse-tika
> ----------------------------------------
>
> Key: NUTCH-840
> URL: https://issues.apache.org/jira/browse/NUTCH-840
> Project: Nutch
> Issue Type: Task
> Components: parser
> Affects Versions: 1.1, 1.6
> Reporter: Julien Nioche
> Fix For: 2.4
>
> Attachments: NUTCH-840-2.x.patch, NUTCH-840-trunk.patch,
> NUTCH-840.patch, NUTCH-840.patch, NUTCH-840v2.patch
>
>
> We don't have test for HTML in parse-tika so I'll copy them from the old
> parse-html plugin
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)