Enis Soztutar commented on NUTCH-808:

So, this is the results so far : 

DataNucleus was previously known as JPOX and it was the reference 
implementation for Java Data objects (JDO). JDO is a java standard for 
persistence. A similar specification, named JPA is also a persistence standard, 
which is forked from EJB 3. However, JPA is designed for RDBMs only, so it will 
not be useful for us 

In JDO, the first step is to define the domain objects as POJOs. Then, the 
persistance metadata is specified either using annotations, XML or both. Then a 
byte code enhancer uses instrumentation to add required methods to the classes 
defined as @PersistanceCapable. The database tables can be generated by hand, 
automatically by datanucleus, or by using a tool (SchemaTool). 
The persistence layer uses standard JDO syntax, which is similar to JDBC. The 
objects can be queried using JPQL. 

I have run a small test to persist objects of WebTableRow class (from NutchBase 
branch) to both MySQL and HBase. Although it took me a fair bit of time to 
set-up, I was able to persist objects to both. 

However, although it is possible to map complex fields (like lists, maps, 
arrays, etc) to RDBMs using different strategies (such as serializing directly, 
using Joins, using Foreign Keys), I was not able to find a way to leverage 
HBase data model. For example, we want to be able to map lists and maps to 
columns in column families. Without such functionality using column oriented 
stores does not bring any advantage. 

For the byte[] serialization for MapReduce, we can either implement a new 
datastore for datanucleus, which also implements Hadoop's Serialization, or use 
Avro to generate Java classes to be feed into JPOX enhancer, or else manually 
implement Writable. 

To sum up, datanucleus brings the following advantages :
- out of the box RDBMs support 
- XML or annotation metadata
- JDO is a Java standard 
- standard query interface
- JSON support

The disadvantages to use DataNucleus would be:
- JDO is rather complex, Implementing a datastore is not very trivial
- We need write patches to datanucleus to flexibly map complex fields to 
leverage HBase's data model
- We have no control on the source code
- no native Hbase support (for example using filters, etc)

On the other hand, current implementation is 
- tested on production, 
- can leverage HBase data model, 
- can be modified to work with Avro serialization directly, 
- cassandra support could be added with little effort
- can support multiple languages (in the future)

I believe that having SQLite, MySQL and HBase support is critical for Nutch 
2.0, for out-of-the-box use, ease of deployment and real-scale computing 
respectively. But obviously we cannot use DataNucleus out of the box either. 

ORM is inherently a hard problem. I propose we go ahead and make the changes to 
DataNucleus to see if it is feasible, and continue with it if it suits our 
needs. Of course, having a custom framework will also be great, so any feedback 
would be more than welcome. 

> Evaluate ORM Frameworks which support non-relational column-oriented 
> datastores and RDBMs 
> ------------------------------------------------------------------------------------------
>                 Key: NUTCH-808
>                 URL: https://issues.apache.org/jira/browse/NUTCH-808
>             Project: Nutch
>          Issue Type: Task
>            Reporter: Enis Soztutar
>            Assignee: Enis Soztutar
>             Fix For: 2.0
> We have an ORM layer in the NutchBase branch, which uses Avro Specific 
> Compiler to compile class definitions given in JSON. Before moving on with 
> this, we might benefit from evaluating other frameworks, whether they suit 
> our needs. 
> We want at least the following capabilities:
> - Using POJOs 
> - Able to persist objects to at least HBase, Cassandra, and RDBMs 
> - Able to efficiently serialize objects as task outputs from Hadoop jobs
> - Allow native queries, along with standard queries 
> Any comments, suggestions for other frameworks are welcome.

This message is automatically generated by JIRA.
If you think it was sent incorrectly contact one of the administrators: 
For more information on JIRA, see: http://www.atlassian.com/software/jira


Reply via email to