Hi,

I submitted the patch for peer review by just attaching it to the
issue: https://issues.apache.org/jira/browse/GORA-22

See this article about concurreny and hashmap to read about the topic:
http://www.ibm.com/developerworks/java/library/j-jtp07233/index.html

I ended up calling toArray over the key set to get around the
ConcurrentModificationException thrown by defaut with
java.util.HashMap when iterating over the keys.

Not that many times I encountered Cassandra crashes and Hector
exceptions (usually because of GC triggered by Cassandra daemon?) with
my poor 5-year-old laptop while running Nutch parse command, which is
very CPU and IO intensive. In mapred-site.xml, see attached config, it
worked out when you make the read batch reasonable (400 rows at a
time) and try to separate it from the write batch (for example 843
written rows per batch) so that they don't happen simultaneously.


Alexis

On Tue, Aug 30, 2011 at 1:24 AM, Alexis <alexis.detregl...@gmail.com> wrote:
> Hi Tom,
>
> Thanks for testing Nutch 2.0 & Cassandra and reporting the obvious
> bug. I must say there is not a very active development and testing on
> Gora & Nutch, but at least there is some.
>
>
> 1. As regards your ConcurrentModification issue, it looks like it
> happens when flushing the store. From your exception stacktrace:
> (Line 192 in org.apache.gora.cassandra.store.CassandraStore)
>    for (K key: this.buffer.keySet()) {
>
> while there are other threads adding new keys to the HashMap:
>
> (Line 266)
>    this.buffer.put(key, p);
>
> "it is not generally permissible for one thread to modify a Collection
> while another thread is iterating over it."
>
> Let me try to reproduce the bug and fix it with this in mind:
> How about introducing some mutex / lock mechanism witch
> java.util.concurrent.locks.Lock or easier, using a thread-safe
> implementation such as java.util.concurrent.ConcurrentHashMap?
>
>
> 2. Regarding the OutOfMemory error, maybe decreasing the flushing
> frecuency as described here?
> http://techvineyard.blogspot.com/2011/02/gora-orm-framework-for-hadoop-jobs.html#I_O_Frequency
>
> I like to use the jvisualvm utility from the JDK that monitors the
> memory usage and tells you how this evolves during the execution of
> the class...
>
> Alexis
>
> On Mon, Aug 29, 2011 at 1:50 PM, Tom Davidson <tdavid...@covario.com> wrote:
>> Hi Lewis,
>>
>> I was running Nutch deployed with a dedicated Cassandra cluster. Frankly, I 
>> have given up on using Nutch 2 at this time as it seems highly unstable and 
>> not really in active development. Your effort to address this is 
>> encouraging. Because Nutch uses multithreading in the fetchers, I was 
>> getting ConcurrentModification errors and OutOfMemory errors on a regular 
>> basis in the CassandraStore. As far as I recall, the caching/flushing 
>> implementation is just not thread safe. If the CassandraStore caching was 
>> completely removed it may work, but would probably not be very efficient.  
>> If I were to fix this class, I would try to rewrite it to use Hector batched 
>> mutations instead.
>>
>> Tom
>>
>> -----Original Message-----
>> From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com]
>> Sent: Monday, August 29, 2011 1:41 PM
>> To: gora-dev@incubator.apache.org; d...@nutch.apache.org
>> Subject: Re: Gora CassandraStore is not thread safe?
>>
>> Hi Tom,
>>
>> Apologies for cross posting, this would not usually be the case but I'm
>> hoping that if any results come from the thread then both communities can
>> benefit.
>>
>> I'm in the process of getting Cassandra 0.8.4 working with Nutch 2.0 and
>> Gora 0.2 myself and seem to be having some nasty problems.
>>
>> Some questions for you
>>
>> 1) How are you running Nutch local or deploy?
>> 2) How are you running Cassandra, local or deployed in a cluster?
>>
>> The obvious thoughts are that this is a bug and that there are
>> method(s)/object(s) which are not safe.
>>
>> Have you gotten any further with this?
>>
>> Lewis
>>
>>
>> On Wed, Aug 10, 2011 at 8:43 PM, Tom Davidson <tdavid...@covario.com> wrote:
>>
>>> Has anyone tested the CassandraStore in gora 0.2 using multiple threads?
>>>  The nutch 2 fetcher architecture has many threads writing to one
>>> GoraRecordWriter and I am getting concurrent modification errors like below.
>>>
>>> Caused by: java.util.ConcurrentModificationException
>>>               at java.util.HashMap$HashIterator.nextEntry(HashMap.java:793)
>>>               at java.util.HashMap$KeyIterator.next(HashMap.java:828)
>>>               at
>>> org.apache.gora.cassandra.store.CassandraStore.flush(CassandraStore.java:192)
>>>               at
>>> org.apache.gora.mapreduce.GoraRecordWriter.write(GoraRecordWriter.java:65)
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>> --
>> *Lewis*
>>
>
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
 <property>
  <name>gora.buffer.read.limit</name>
  <value>400</value>
 </property> 
 <property>
  <name>gora.buffer.write.limit</name>
  <value>843</value>
 </property>
</configuration>

Reply via email to