Last revision 1177960 should now fix the thread-safe issue: http://svn.apache.org/viewvc/incubator/gora/trunk/gora-cassandra/src/main/java/org/apache/gora/cassandra/store/CassandraStore.java?r1=1177960&r2=1177959&pathrev=1177960
Please comment on https://issues.apache.org/jira/browse/GORA-22 if there is anything else. Alexis On Sun, Sep 4, 2011 at 10:43 AM, Alexis <alexis.detregl...@gmail.com> wrote: > Hi, > > I submitted the patch for peer review by just attaching it to the > issue: https://issues.apache.org/jira/browse/GORA-22 > > See this article about concurreny and hashmap to read about the topic: > http://www.ibm.com/developerworks/java/library/j-jtp07233/index.html > > I ended up calling toArray over the key set to get around the > ConcurrentModificationException thrown by defaut with > java.util.HashMap when iterating over the keys. > > Not that many times I encountered Cassandra crashes and Hector > exceptions (usually because of GC triggered by Cassandra daemon?) with > my poor 5-year-old laptop while running Nutch parse command, which is > very CPU and IO intensive. In mapred-site.xml, see attached config, it > worked out when you make the read batch reasonable (400 rows at a > time) and try to separate it from the write batch (for example 843 > written rows per batch) so that they don't happen simultaneously. > > > Alexis > > On Tue, Aug 30, 2011 at 1:24 AM, Alexis <alexis.detregl...@gmail.com> wrote: >> Hi Tom, >> >> Thanks for testing Nutch 2.0 & Cassandra and reporting the obvious >> bug. I must say there is not a very active development and testing on >> Gora & Nutch, but at least there is some. >> >> >> 1. As regards your ConcurrentModification issue, it looks like it >> happens when flushing the store. From your exception stacktrace: >> (Line 192 in org.apache.gora.cassandra.store.CassandraStore) >> for (K key: this.buffer.keySet()) { >> >> while there are other threads adding new keys to the HashMap: >> >> (Line 266) >> this.buffer.put(key, p); >> >> "it is not generally permissible for one thread to modify a Collection >> while another thread is iterating over it." >> >> Let me try to reproduce the bug and fix it with this in mind: >> How about introducing some mutex / lock mechanism witch >> java.util.concurrent.locks.Lock or easier, using a thread-safe >> implementation such as java.util.concurrent.ConcurrentHashMap? >> >> >> 2. Regarding the OutOfMemory error, maybe decreasing the flushing >> frecuency as described here? >> http://techvineyard.blogspot.com/2011/02/gora-orm-framework-for-hadoop-jobs.html#I_O_Frequency >> >> I like to use the jvisualvm utility from the JDK that monitors the >> memory usage and tells you how this evolves during the execution of >> the class... >> >> Alexis >> >> On Mon, Aug 29, 2011 at 1:50 PM, Tom Davidson <tdavid...@covario.com> wrote: >>> Hi Lewis, >>> >>> I was running Nutch deployed with a dedicated Cassandra cluster. Frankly, I >>> have given up on using Nutch 2 at this time as it seems highly unstable and >>> not really in active development. Your effort to address this is >>> encouraging. Because Nutch uses multithreading in the fetchers, I was >>> getting ConcurrentModification errors and OutOfMemory errors on a regular >>> basis in the CassandraStore. As far as I recall, the caching/flushing >>> implementation is just not thread safe. If the CassandraStore caching was >>> completely removed it may work, but would probably not be very efficient. >>> If I were to fix this class, I would try to rewrite it to use Hector >>> batched mutations instead. >>> >>> Tom >>> >>> -----Original Message----- >>> From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com] >>> Sent: Monday, August 29, 2011 1:41 PM >>> To: gora-dev@incubator.apache.org; d...@nutch.apache.org >>> Subject: Re: Gora CassandraStore is not thread safe? >>> >>> Hi Tom, >>> >>> Apologies for cross posting, this would not usually be the case but I'm >>> hoping that if any results come from the thread then both communities can >>> benefit. >>> >>> I'm in the process of getting Cassandra 0.8.4 working with Nutch 2.0 and >>> Gora 0.2 myself and seem to be having some nasty problems. >>> >>> Some questions for you >>> >>> 1) How are you running Nutch local or deploy? >>> 2) How are you running Cassandra, local or deployed in a cluster? >>> >>> The obvious thoughts are that this is a bug and that there are >>> method(s)/object(s) which are not safe. >>> >>> Have you gotten any further with this? >>> >>> Lewis >>> >>> >>> On Wed, Aug 10, 2011 at 8:43 PM, Tom Davidson <tdavid...@covario.com> wrote: >>> >>>> Has anyone tested the CassandraStore in gora 0.2 using multiple threads? >>>> The nutch 2 fetcher architecture has many threads writing to one >>>> GoraRecordWriter and I am getting concurrent modification errors like >>>> below. >>>> >>>> Caused by: java.util.ConcurrentModificationException >>>> at java.util.HashMap$HashIterator.nextEntry(HashMap.java:793) >>>> at java.util.HashMap$KeyIterator.next(HashMap.java:828) >>>> at >>>> org.apache.gora.cassandra.store.CassandraStore.flush(CassandraStore.java:192) >>>> at >>>> org.apache.gora.mapreduce.GoraRecordWriter.write(GoraRecordWriter.java:65) >>>> >>>> >>>> >>>> >>>> >>>> >>> >>> >>> -- >>> *Lewis* >>> >> >