Vincent, Thanks for the patch. First off feel free to go ahead and attach this to the JIRA issue directly.[1] We typically review patches that have been attached there.
To answer some of your questions: >> HBaseSourceTarget implements TableSource<..., ...>, but GoraSourceTarget implements Source<Pair<K, V>>, Gora DataStore is a map and not a multimap. Should it be a TableSource anyway ? Not being familiar with Gora, do consumer typically interact with the data in a K/V manner? While PTable's can be multimaps they don't necessarily have to be. Making the data available as a PTable would make sense if consumers would typically need to do joins/grouping on a key meaningful to Gora. As an example in HBase, a consumer might set a batching value that would break up a single row. Making grouping easier allows the consumer to recombine the row for processing. >> GoraSourceIT test failure When using the MRPipeline it will actually serialize and instantiate the source on the cluster vs the instance you created in memory which is used by the MemPipeline. If you need values like start and end key which I see in your GoraSourceTarget to be available when running on MRPipeline then those will need to be properly configured. Look at how the HBase impls make scans available. Also when quickly glancing through this guide saw references to init calls. http://gora.apache.org/current/tutorial.html#constructing-the-job While you don't necessarily have to call those mappers you'll probably want to make sure any config they are doing is handled in the Source/Target setup. >> - Should there be an equivalent to HBaseTypes.puts and HBaseTypes.deletes with Gora? Once again not familiar with Gora but the need for those are predicated on how consumers would typically interact with the system. Those representations make it easy for consumers to perform standard operations on HBase without having to worry about serializing the HBase Put into the correct byte[] for the service to know what to do. It doesn't look like Gora necessarily have straight Puts/Deletes like HBase https://hbase.apache.org/0.94/apidocs/org/apache/hadoop/hbase/client/Put.html So the question is how does one represent an insertion/deletion in the Gora input format or output format? >> Crunch & Eclipse warning: You can ignore that lifecycle warning if you are building everything through Maven and not Eclipse. I believe that is just because the sources being generated for the tests are not being handled by Eclipse when it is trying to control that project. >> More generally, what about code quality? (still junior...) I haven't gotten a chance to do a deep review of your code. But don't worry about that we can help with that. Thanks, Micah [1] - https://issues.apache.org/jira/browse/CRUNCH-184 On Mon, May 25, 2015 at 3:06 AM, Vincent Fabro < vincent.fabro.nu...@gmail.com> wrote: > Dear all > > A patch for a crude Gora backend implementation is attached. I copy-pasted > the HBase implementation and made modifications. > > I have questions to push it further: > > - HBaseSourceTarget implements TableSource<..., ...>, but > GoraSourceTarget implements Source<Pair<K, V>>, Gora DataStore is a map > and not a multimap. Should it be a TableSource anyway ? > > - I made simple examples in GoraSourceIT (will be removed, no proper tests > yet). You can read/write to a GoraSourceTarget when using MemPipeline, but > MRPipeline gives the following error when reading from a Gora MemStore > (GoraSourceIT.testGoraTarget()): > 1035 [main] WARN org.apache.hadoop.util.NativeCodeLoader - Unable to > load native-hadoop library for your platform... using builtin-java classes > where applicable > 2205 [Thread-2] WARN org.apache.hadoop.mapreduce.JobSubmitter - Hadoop > command-line option parsing not performed. Implement the Tool interface and > execute your application with ToolRunner to remedy this. > 2207 [Thread-2] WARN org.apache.hadoop.mapreduce.JobSubmitter - No job > jar file set. User classes may not be found. See Job or Job#setJar(String). > 2925 [Thread-2] INFO > org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob - > Running job "org.apache.crunch.io.gora.GoraSourceIT: > GoraDataStore(org.apache.gora.memory.store.MemStore@2b3b2... ID=1 (1/1)" > 2925 [Thread-2] INFO > org.apache.crunch.hadoop.mapreduce.lib.jobcontrol.CrunchControlledJob - > Job status available at: http://localhost:8080/ > java.util.NoSuchElementException > at java.util.TreeMap.key(TreeMap.java:1221) > at java.util.TreeMap.firstKey(TreeMap.java:285) > at org.apache.gora.memory.store.MemStore.execute(MemStore.java:125) > at org.apache.gora.query.impl.QueryBase.execute(QueryBase.java:73) > at > org.apache.gora.mapreduce.GoraRecordReader.executeQuery(GoraRecordReader.java:68) > at > org.apache.gora.mapreduce.GoraRecordReader.nextKeyValue(GoraRecordReader.java:110) > at > org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:532) > at > org.apache.hadoop.mapreduce.task.MapContextImpl.nextKeyValue(MapContextImpl.java:80) > at > org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.nextKeyValue(WrappedMapper.java:91) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:763) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339) > at > org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:235) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > at java.lang.Thread.run(Thread.java:745) > > - Should there be an equivalent to HBaseTypes.puts and HBaseTypes.deletes > with Gora? > > - When Crunch was imported to Eclipse, the following problem appeared in > crunch-hbase/pom.xml: > Plugin execution not covered by lifecycle configuration: > org.apache.maven.plugins:maven-dependency-plugin:2.8:build-classpath > (execution: create-mrapp-generated-classpath, phase: generate-test- > resources) > What could be the reason (for the moment I let Eclipse automatically fix > the problem) ? > > - More generally, what about code quality? (still junior...) > > I don't know if it's headed in the right place, so thanks in advance for > your directions. > > Vincent >