ContentLength not trimmed
-------------------------
Key: NUTCH-1010
URL: https://issues.apache.org/jira/browse/NUTCH-1010
Project: Nutch
Issue Type: Bug
Components: indexer
Reporter: Markus Jelsma
Somewhere in some component the ContentLength field is not trimmed. This allows
a seemingly numeric field to be treated as a string by the indexer in cases one
or more leading or trailing whitespace is added. The result is a hard to debug
exception with no way to identify the bad document (amongst thousands) or the
bad field.
{code}
Jun 22, 2011 1:03:42 PM org.apache.solr.common.SolrException log
SEVERE: java.lang.NumberFormatException: For input string: "32717 "
at
java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
at java.lang.Long.parseLong(Long.java:419)
at java.lang.Long.parseLong(Long.java:468)
{code}
This can be quickly fixed in the index-more plugin by simply using the trim()
when adding the field.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira