Hi Ted,
This is a Tika issue, and one that's been on my list for a while to
file/fix - thanks for the reminder :)
-- Ken
On Feb 18, 2010, at 4:31pm, Ted Yu wrote:
Hi,
We use nutch 1.0
I found that for certain web pages, e.g.
http://www.funnycorner.net/funny-pictures/4060/funny-people-pictures/thieves-snort-dogs-ashes-in-cocaine-bungle.html
,
<http://www.funnycorner.net/funny-pictures/4060/funny-people-pictures/thieves-snort-dogs-ashes-in-cocaine-bungle.html
>
org.apache.nutch.parse.ParseText contains newline - see sample below.
"Forums: Sites: Share This Funny Picture
on : <a href="
http://www.funnycorner.net/funny-pictures/4060/funny-people-pictures/thieves-snort-dogs-ashes-in
-cocaine-bungle.html" title="Thieves Snort Dogs Ashes In Cocaine
Bungle"
target="_blank">"
Our downstream parsing utility assumes that parse text is a single
line.
Is there a JIRA that is going to fix this issue ?
Thanks
--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g