[
https://issues.apache.org/jira/browse/NUTCH-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14194768#comment-14194768
]
Renato Javier Marroquín Mogrovejo commented on NUTCH-1791:
----------------------------------------------------------
Hey [~lewismc], this is the data evolution problem we have been discussing
lately. The main problem I see here is that we are also making Nutch change the
data schema that it uses. I mean if it uses field A with type AA, and then we
decide to write A with type BB, then of course such problem will arise.
Gora allows the reader schema view now i.e. it tries to read what you tell it
to read, but you might have some other type of data stored. So one solution is
to use an older schema (which will compile to a correct data bean) and the
other one (union specific solution) is to "try" deserialize with the other type
values of the union. But this could lead into bad results as well, the union
field might say types [null, string], but it was actually written as integers.
Gora enforces the reader's schema view, but we need a way to support writer's
schema perspective as well.
> Null pointer exceptions with gora-cassandra-0.4
> -----------------------------------------------
>
> Key: NUTCH-1791
> URL: https://issues.apache.org/jira/browse/NUTCH-1791
> Project: Nutch
> Issue Type: Bug
> Components: generator, storage
> Affects Versions: 2.3
> Environment: dsc-cassandra-2.0.2, dsc-cassandra-2.0.7
> Reporter: Koen Smets
> Fix For: 2.4
>
>
> Latest nutch-2.x source checkout fails to run with Cassandra 2.0.2 (and also
> Cassandra 2.0.7) as storage backend both in normal Nutch operations (inject,
> generate, fetch) cycle as in the junit tests {{TestGoraStorage}}
> {code}
> 2014-06-03 11:24:23,495 INFO connection.CassandraHostRetryService
> (CassandraHostRetryService.java:<init>(48)) - Downed Host Retry service
> started with queue size -1 and retry delay 10s
> 2014-06-03 11:24:23,535 INFO service.JmxMonitor
> (JmxMonitor.java:registerMonitor(52)) - Registering JMX
> me.prettyprint.cassandra.service_Test
> Cluster:ServiceType=hector,MonitorType=hector
> Exception in thread "main" java.lang.NullPointerException
> at
> org.apache.gora.cassandra.query.CassandraResult.updatePersistent(CassandraResult.java:121)
> at
> org.apache.gora.cassandra.query.CassandraResult.nextInner(CassandraResult.java:57)
> at org.apache.gora.query.impl.ResultBase.next(ResultBase.java:114)
> at
> org.apache.nutch.storage.TestGoraStorage.readWrite(TestGoraStorage.java:93)
> at
> org.apache.nutch.storage.TestGoraStorage.main(TestGoraStorage.java:230)
> {code}
> After injecting:
> {code}
> ksmets@precise64 ~/l/a/r/local> ./bin/nutch inject urls
> InjectorJob: starting at 2014-06-03 11:55:11
> InjectorJob: Injecting urlDir: urls
> InjectorJob: Using class org.apache.gora.cassandra.store.CassandraStore as
> the Gora storage class.
> InjectorJob: total number of urls rejected by filters: 0
> InjectorJob: total number of urls injected after normalization and filtering:
> 1
> Injector: finished at 2014-06-03 11:55:13, elapsed: 00:00:02
> ksmets@precise64 ~/l/a/r/local> ./bin/nutch readdb -stats
> WebTable statistics start
> Statistics for WebTable:
> min score: 1.0
> retry 0: 1
> jobs: {db_stats-job_local1403358409_0001={jobID=job_local1403358409_0001,
> jobName=db_stats, counters={File Input Format Counters ={BYTES_READ=0},
> Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=97, MAP_INPUT_RECORDS=1,
> REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=12, MAP_OUTPUT_BYTES=53,
> COMMITTED_HEAP_BYTES=358612992, CPU_MILLISECONDS=0, SPLIT_RAW_BYTES=769,
> COMBINE_INPUT_RECORDS=4, REDUCE_INPUT_RECORDS=6, REDUCE_INPUT_GROUPS=6,
> COMBINE_OUTPUT_RECORDS=6, PHYSICAL_MEMORY_BYTES=0, REDUCE_OUTPUT_RECORDS=6,
> VIRTUAL_MEMORY_BYTES=0, MAP_OUTPUT_RECORDS=4},
> FileSystemCounters={FILE_BYTES_READ=974145, FILE_BYTES_WRITTEN=1144369}, File
> Output Format Counters ={BYTES_WRITTEN=225}}}}
> max score: 1.0
> TOTAL urls: 1
> status 0 (null): 1
> avg score: 1.0
> WebTable statistics: done
> ksmets@precise64 ~/l/a/r/local> ./bin/nutch readdb -url http://example.com/
> key: http://example.com/
> baseUrl: null
> status: 0 (null)
> fetchTime: 1401789311270
> prevFetchTime: 0
> fetchInterval: 2592000
> retriesSinceFetch: 0
> modifiedTime: 0
> prevModifiedTime: 0
> protocolStatus: (null)
> parseStatus: (null)
> title: null
> score: 1.0
> markers: org.apache.gora.persistency.impl.DirtyMapWrapper@eb173c
> reprUrl: null
> metadata _csh_ : ?�
> {code}
> After generating,
> {code}
> ksmets@precise64 ~/l/a/r/local> ./bin/nutch generate -topN 1
> GeneratorJob: starting at 2014-06-03 11:55:38
> GeneratorJob: Selecting best-scoring urls due for fetch.
> GeneratorJob: starting
> GeneratorJob: filtering: true
> GeneratorJob: normalizing: true
> GeneratorJob: topN: 1
> GeneratorJob: finished at 2014-06-03 11:55:40, time elapsed: 00:00:02
> GeneratorJob: generated batch id: 1401789338-222512082 containing 1 URLs
> ksmets@precise64 ~/l/a/r/local> ./bin/nutch readdb -stats
> WebTable statistics start
> Statistics for WebTable:
> jobs: {db_stats-job_local73029265_0001={jobID=job_local73029265_0001,
> jobName=db_stats, counters={File Input Format Counters ={BYTES_READ=0},
> Map-Reduce Framework={MAP_OUTPUT_MATERIALIZED_BYTES=6, MAP_INPUT_RECORDS=0,
> REDUCE_SHUFFLE_BYTES=0, SPILLED_RECORDS=0, MAP_OUTPUT_BYTES=0,
> COMMITTED_HEAP_BYTES=358612992, CPU_MILLISECONDS=0, SPLIT_RAW_BYTES=769,
> COMBINE_INPUT_RECORDS=0, REDUCE_INPUT_RECORDS=0, REDUCE_INPUT_GROUPS=0,
> COMBINE_OUTPUT_RECORDS=0, PHYSICAL_MEMORY_BYTES=0, REDUCE_OUTPUT_RECORDS=0,
> VIRTUAL_MEMORY_BYTES=0, MAP_OUTPUT_RECORDS=0},
> FileSystemCounters={FILE_BYTES_READ=974054, FILE_BYTES_WRITTEN=1144028}, File
> Output Format Counters ={BYTES_WRITTEN=98}}}}
> TOTAL urls: 0
> WebTable statistics: done
> ksmets@precise64 ~/l/a/r/local> ./bin/nutch readdb -url http://example.com/
> WebTableReader: java.lang.NullPointerException
> at
> org.apache.gora.cassandra.query.CassandraResult.updatePersistent(CassandraResult.java:121)
> at
> org.apache.gora.cassandra.query.CassandraResult.nextInner(CassandraResult.java:57)
> at org.apache.gora.query.impl.ResultBase.next(ResultBase.java:114)
> at org.apache.nutch.crawl.WebTableReader.read(WebTableReader.java:238)
> at org.apache.nutch.crawl.WebTableReader.run(WebTableReader.java:494)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.nutch.crawl.WebTableReader.main(WebTableReader.java:430)
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)