Hi Ted,

This is a Tika issue, and one that's been on my list for a while to file/fix - thanks for the reminder :)

-- Ken

On Feb 18, 2010, at 4:31pm, Ted Yu wrote:

Hi,
We use nutch 1.0
I found that for certain web pages, e.g.
http://www.funnycorner.net/funny-pictures/4060/funny-people-pictures/thieves-snort-dogs-ashes-in-cocaine-bungle.html , <http://www.funnycorner.net/funny-pictures/4060/funny-people-pictures/thieves-snort-dogs-ashes-in-cocaine-bungle.html >
org.apache.nutch.parse.ParseText contains newline - see sample below.

"Forums: Sites: Share This Funny Picture
on : <a href="
http://www.funnycorner.net/funny-pictures/4060/funny-people-pictures/thieves-snort-dogs-ashes-in
-cocaine-bungle.html" title="Thieves Snort Dogs Ashes In Cocaine Bungle"
target="_blank">"

Our downstream parsing utility assumes that parse text is a single line.

Is there a JIRA that is going to fix this issue ?

Thanks

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g




Reply via email to