Hi,
We use nutch 1.0
I found that for certain web pages, e.g.
http://www.funnycorner.net/funny-pictures/4060/funny-people-pictures/thieves-snort-dogs-ashes-in-cocaine-bungle.html,
<http://www.funnycorner.net/funny-pictures/4060/funny-people-pictures/thieves-snort-dogs-ashes-in-cocaine-bungle.html>
org.apache.nutch.parse.ParseText contains newline - see sample below.

"Forums: Sites: Share This Funny Picture
on : <a href="
http://www.funnycorner.net/funny-pictures/4060/funny-people-pictures/thieves-snort-dogs-ashes-in
-cocaine-bungle.html" title="Thieves Snort Dogs Ashes In Cocaine Bungle"
target="_blank">"

Our downstream parsing utility assumes that parse text is a single line.

Is there a JIRA that is going to fix this issue ?

Thanks

Reply via email to