ContentLength not trimmed
-------------------------

                 Key: NUTCH-1010
                 URL: https://issues.apache.org/jira/browse/NUTCH-1010
             Project: Nutch
          Issue Type: Bug
          Components: indexer
            Reporter: Markus Jelsma


Somewhere in some component the ContentLength field is not trimmed. This allows 
a seemingly numeric field to be treated as a string by the indexer in cases one 
or more leading or trailing whitespace is added. The result is a hard to debug 
exception with no way to identify the bad document (amongst thousands) or the 
bad field.

{code}
Jun 22, 2011 1:03:42 PM org.apache.solr.common.SolrException log
SEVERE: java.lang.NumberFormatException: For input string: "32717     "
        at 
java.lang.NumberFormatException.forInputString(NumberFormatException.java:48)
        at java.lang.Long.parseLong(Long.java:419)
        at java.lang.Long.parseLong(Long.java:468)
{code}

This can be quickly fixed in the index-more plugin by simply using the trim() 
when adding the field.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to