Hi, I submitted the patch for peer review by just attaching it to the issue: https://issues.apache.org/jira/browse/GORA-22
See this article about concurreny and hashmap to read about the topic: http://www.ibm.com/developerworks/java/library/j-jtp07233/index.html I ended up calling toArray over the key set to get around the ConcurrentModificationException thrown by defaut with java.util.HashMap when iterating over the keys. Not that many times I encountered Cassandra crashes and Hector exceptions (usually because of GC triggered by Cassandra daemon?) with my poor 5-year-old laptop while running Nutch parse command, which is very CPU and IO intensive. In mapred-site.xml, see attached config, it worked out when you make the read batch reasonable (400 rows at a time) and try to separate it from the write batch (for example 843 written rows per batch) so that they don't happen simultaneously. Alexis On Tue, Aug 30, 2011 at 1:24 AM, Alexis <alexis.detregl...@gmail.com> wrote: > Hi Tom, > > Thanks for testing Nutch 2.0 & Cassandra and reporting the obvious > bug. I must say there is not a very active development and testing on > Gora & Nutch, but at least there is some. > > > 1. As regards your ConcurrentModification issue, it looks like it > happens when flushing the store. From your exception stacktrace: > (Line 192 in org.apache.gora.cassandra.store.CassandraStore) > for (K key: this.buffer.keySet()) { > > while there are other threads adding new keys to the HashMap: > > (Line 266) > this.buffer.put(key, p); > > "it is not generally permissible for one thread to modify a Collection > while another thread is iterating over it." > > Let me try to reproduce the bug and fix it with this in mind: > How about introducing some mutex / lock mechanism witch > java.util.concurrent.locks.Lock or easier, using a thread-safe > implementation such as java.util.concurrent.ConcurrentHashMap? > > > 2. Regarding the OutOfMemory error, maybe decreasing the flushing > frecuency as described here? > http://techvineyard.blogspot.com/2011/02/gora-orm-framework-for-hadoop-jobs.html#I_O_Frequency > > I like to use the jvisualvm utility from the JDK that monitors the > memory usage and tells you how this evolves during the execution of > the class... > > Alexis > > On Mon, Aug 29, 2011 at 1:50 PM, Tom Davidson <tdavid...@covario.com> wrote: >> Hi Lewis, >> >> I was running Nutch deployed with a dedicated Cassandra cluster. Frankly, I >> have given up on using Nutch 2 at this time as it seems highly unstable and >> not really in active development. Your effort to address this is >> encouraging. Because Nutch uses multithreading in the fetchers, I was >> getting ConcurrentModification errors and OutOfMemory errors on a regular >> basis in the CassandraStore. As far as I recall, the caching/flushing >> implementation is just not thread safe. If the CassandraStore caching was >> completely removed it may work, but would probably not be very efficient. >> If I were to fix this class, I would try to rewrite it to use Hector batched >> mutations instead. >> >> Tom >> >> -----Original Message----- >> From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com] >> Sent: Monday, August 29, 2011 1:41 PM >> To: gora-dev@incubator.apache.org; d...@nutch.apache.org >> Subject: Re: Gora CassandraStore is not thread safe? >> >> Hi Tom, >> >> Apologies for cross posting, this would not usually be the case but I'm >> hoping that if any results come from the thread then both communities can >> benefit. >> >> I'm in the process of getting Cassandra 0.8.4 working with Nutch 2.0 and >> Gora 0.2 myself and seem to be having some nasty problems. >> >> Some questions for you >> >> 1) How are you running Nutch local or deploy? >> 2) How are you running Cassandra, local or deployed in a cluster? >> >> The obvious thoughts are that this is a bug and that there are >> method(s)/object(s) which are not safe. >> >> Have you gotten any further with this? >> >> Lewis >> >> >> On Wed, Aug 10, 2011 at 8:43 PM, Tom Davidson <tdavid...@covario.com> wrote: >> >>> Has anyone tested the CassandraStore in gora 0.2 using multiple threads? >>> The nutch 2 fetcher architecture has many threads writing to one >>> GoraRecordWriter and I am getting concurrent modification errors like below. >>> >>> Caused by: java.util.ConcurrentModificationException >>> at java.util.HashMap$HashIterator.nextEntry(HashMap.java:793) >>> at java.util.HashMap$KeyIterator.next(HashMap.java:828) >>> at >>> org.apache.gora.cassandra.store.CassandraStore.flush(CassandraStore.java:192) >>> at >>> org.apache.gora.mapreduce.GoraRecordWriter.write(GoraRecordWriter.java:65) >>> >>> >>> >>> >>> >>> >> >> >> -- >> *Lewis* >> >
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <configuration> <property> <name>gora.buffer.read.limit</name> <value>400</value> </property> <property> <name>gora.buffer.write.limit</name> <value>843</value> </property> </configuration>