Koen Smets created NUTCH-1723:
---------------------------------

             Summary: nutch updatedb fails due to avro (de)serialization issues 
on images
                 Key: NUTCH-1723
                 URL: https://issues.apache.org/jira/browse/NUTCH-1723
             Project: Nutch
          Issue Type: Bug
          Components: crawldb, parser
    Affects Versions: 2.2.1, 2.3
         Environment: - Ubuntu 12.04.3 LTS (GNU/Linux 3.2.0-36-generic x86_64)
- DataStax Community Edition Apache Cassandra 2.0.4

            Reporter: Koen Smets
             Fix For: 2.3


Running `bin/crawl` for 2 iterations using either the nutch-2.2.1 release or  
the latest 2.x checkout on a seed file containing for example 
http://www.mountsinai.on.ca and http://www.dhzb.de (or any other webpage with 
image files with no obvious file extensions) causes to throw either 
java.lang.IllegalArgument, IOException and/or OutOfBoundsExceptions in the the 
readFields function of WebPageWritable:

  @Override
  public void readFields(DataInput in) throws IOException {
    webPage = IOUtils.deserialize(getConf(), in, webPage, WebPage.class);
  }

  @Override
  public void write(DataOutput out) throws IOException {
    IOUtils.serialize(getConf(), out, webPage, WebPage.class);
  }

2014-02-04 13:50:15,421 INFO  util.WebPageWritable - Try reading fields: ...
2014-02-04 13:50:15,423 ERROR util.WebPageWritable - Error - Failed to read 
fields: http://www.mountsinai.on.ca/carousel/patient-care-banner/image
2014-02-04 13:50:15,423 ERROR util.WebPageWritable - Error - Reading fields of 
the WebPage class failed - java.lang.IllegalArgumentException
2014-02-04 13:50:15,425 ERROR util.WebPageWritable - Error - Printing 
stacktrace - java.lang.IllegalArgumentException

Or, 
java.lang.IndexOutOfBoundsException
        at java.nio.Buffer.checkBounds(Buffer.java:559)
        at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:143)
        at 
org.apache.avro.ipc.ByteBufferInputStream.read(ByteBufferInputStream.java:52)
        at 
org.apache.avro.io.DirectBinaryDecoder.doReadBytes(DirectBinaryDecoder.java:183)
        at org.apache.avro.io.BinaryDecoder.readString(BinaryDecoder.java:265)
        at 
org.apache.gora.mapreduce.FakeResolvingDecoder.readString(FakeResolvingDecoder.java:131)
        at 
org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:280)
        at 
org.apache.avro.generic.GenericDatumReader.readMap(GenericDatumReader.java:191)
        at 
org.apache.gora.avro.PersistentDatumReader.readMap(PersistentDatumReader.java:183)
        at 
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:83)
        at 
org.apache.gora.avro.PersistentDatumReader.readRecord(PersistentDatumReader.java:139)
        at 
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:80)
        at 
org.apache.gora.avro.PersistentDatumReader.read(PersistentDatumReader.java:103)
        at 
org.apache.gora.avro.PersistentDatumReader.read(PersistentDatumReader.java:98)
        at 
org.apache.gora.mapreduce.PersistentDeserializer.deserialize(PersistentDeserializer.java:73)
        at 
org.apache.gora.mapreduce.PersistentDeserializer.deserialize(PersistentDeserializer.java:36)
        at org.apache.gora.util.IOUtils.deserialize(IOUtils.java:205)
        at 
org.apache.nutch.util.WebPageWritable.readFields(WebPageWritable.java:45)
        at 
org.apache.nutch.util.GenericWritableConfigurable.readFields(GenericWritableConfigurable.java:54)
        at 
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
        at 
org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
        at 
org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:117)
        at 
org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92)
        at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
        at 
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
        at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)


The exceptions are caused by image files that sneak through the urlfilter (no 
extension indicating an image file) and that get (properly?) parsed by tika 
library.

Note that silently catching the thrown exceptions causes corruption of the 
Cassandra database, as the deserializer reads over multiple webpage entries in 
the DataInput. Resulting in a loss of several pages of other host present in 
the seed file.

Moreover, if one makes sure that the image pages don't end up in the DataInput 
written by DBUpdateMapper, e.g. by configuring nutch-site.xml to disable the 
tika parser, the nutch dbupdate finishes properly.

<property>
  <name>plugin.excludes</name>
  <value>parse-tika</value>
</property>

I highly suspect that the issues are due to gora's dependency on the outdated 
avro-1.3.3 library.




--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to