Hi all, First off, sorry for cross-posting, but I think the announcement is relevant to hadoop, hbase and avro communities, all.
We would like to introduce a new project, called Gora. Gora is a Java ORM layer for column stores, SQL databases, key-value stores and document databases. The design goal is to have a common API to access and manage multiple data stores. Gora differs from Java ORM frameworks, in that the special focus is given to column oriented data bases, like Apache HBase and Apache Cassandra. Gora, in no way, is a replacement for Hibernate, DataNucleus or [insert the name of your favorite ORM project]. But we think we differ from traditional data stores from the following perspectives. - Gora is specifically designed with NoSQL data stores in mind. For example, the API is based on <key, value> pairs, rather than just beans. Also, we believe that the ORM layer should be tuned for batch operations (like first class object re-use support), - Gora uses Avro, to generate data beans from avro schemas. Moreover, most of the serializations are delegated to avro. For example, a map is serialized to a field (if not configured otherwise) using Avro serialization. - Gora provides first-class support for Hadoop MapReduce. DataStore implementations are responsible for partitioning the data (which is then converted to Hadoop Splits), and all the locality information is again obtained from the data store. Developing MapReduce jobs with Gora is really easy. - The long term goal for Gora is to be an intermediate data format for popular big-data and search related projects. In the middle term, we plan to support Cassandra, Cascading, Pig and Solr. Think of the possibilities when you can use the same data structures to persist objects to Hbase, SQL and Solr. And use Pig or Cascading in jobs to mine the data stored at HBase/Cassandra/SQL/etc. Gora works as follows. You define the data structures for your domain using regular Avro Json schemas. Then instead of compiling the avro files with Avro's compiler, you compile the files with GoraCompiler. Generated keep track of the persistency information along with the data. Then for each data back-end, you define a mapping file which contains class fields to data store specific schema configuration. For example, HBase mapping files, define the column families, and mappings from fields to columns or column families, whereas SQL mapping files define mappings for table fields. Gora has started in NutchBase(http://github.com/dogacan/nutchbase), a branch of Apache Nutch(http://nutch.apache.org/) which is being used as a basis for what will become Nutch 2.0. For the second version of the popular open source web search project, an abstraction layer was needed so that the core data structures for Nutch would no longer be kept as flat files on Hadoop. We wanted to be able to use popular NoSQL databases (HBase, Cassandra, Hypertable, etc), optionally flat files, and SQL databases (especially embedded zero-conf SQL databases). So Gora as a project was born. Gora is now in pre-alpha stage, with a public release planned before the end of the year. Documentation is also very sparse at this point. However, the code is already used at NutchBase and will be used in Nutch 2.0. We currently support HBase, plain Avro data files and SQL. Cassandra support is coming soon. Of course, the current set of developers is very small, and we need your help in achieving these goals. So feel free to contribute in any way you see fit. We believe in the Apache way of development and in fact, one of the possible paths for Gora is to be accepted as a sub project of Incubator or Hadoop (we welcome any feedback on this). Lastly, you can find the project at http://github.com/enis/gora/. Some example code is at http://github.com/enis/gora/tree/master/gora-core/src/examples/ and http://github.com/dogacan/nutchbase. Feel free to use this list, or [email protected] for further discussion. Thanks, Enis Söztutar tl;dr Gora is an ORM layer with a specific focus on NoSQL data stores. It has HBase, SQL, Avro and Mapreduce support.
