Kevin Ratnasekera commented on GORA-401:

[~alfonso.nishikawa] Thanks for review and comments :). Even though we have two 
approaches here, the idea behind is exactly the same. I have tested your test 
cases against the fix. Those HBase test case passes successfully when the fix 
is present. I did even debug manually the test case written, and noticed that 
dirty state of persistent bean is correctly maintained between map and reduce 
methods MR job before writing to context. That guarantee dirty data bean get 
successfully written to  out dataStore when we write to context.

I have update my PR against your HBase test case provided here with patch. I 
also think this should work perfectly fine with other dataStores other than 
HBase as well.


> Serialization and deserialization of Persistent does not hold the entity 
> dirty state from Map to Reduce
> -------------------------------------------------------------------------------------------------------
>                 Key: GORA-401
>                 URL: https://issues.apache.org/jira/browse/GORA-401
>             Project: Apache Gora
>          Issue Type: Bug
>          Components: gora-core
>    Affects Versions: 0.4, 0.5
>         Environment: Tested on gora-0.4, but seems logically to hold on 
> gora-0.5. HBase backend.
>            Reporter: Alfonso Nishikawa
>            Assignee: Alfonso Nishikawa
>            Priority: Critical
>              Labels: serialization
>             Fix For: 0.8
>         Attachments: GORA-401-tests.patch, GORA-401v1.patch, 
> GORA-401v2.patch, GORA-401v3.patch, GORA-401v4.patch, GORA-401v5.patch
>   Original Estimate: 35h
>          Time Spent: 21h
>  Remaining Estimate: 14h
> After removing __g__dirty field in GORA-326, dirty field is not serialized. 
> In GORA-321 
> {{[PersistentSerializer|https://github.com/apache/gora/blob/master/gora-core/src/main/java/org/apache/gora/mapreduce/PersistentSerializer.java]}}
>  went from using 
> {{[PersistentDatumWriter|https://github.com/apache/gora/blob/apache-gora-0.3/gora-core/src/main/java/org/apache/gora/avro/PersistentDatumWriter.java](/Reader)}}
>  to Avro's {{SpecificDatumWriter}}, delegating the serialization of the dirty 
> field to Avro (but really not desirable to have that field as a main field in 
> the entities).
> The proposal is to reintroduce the {{PersistentDatumWriter/Reader}} which 
> will serialize the internal fields of the entities.
> This bug affects, for example, Nutch, which loads only some fields in it's 
> phases, serializes entities (from Map to Reduce), and when deserializes finds 
> all fields as "dirty", independently of what fields were modified in the Map, 
> and overwrite all data in datastore (deleting much things: downloaded 
> content, parsed content, etc).
> This effect can be seen in 
> {{TestPersistentSerialization#testSerderEmployeeTwoFields}}, when debuging in 
> {{TestIOUtils#testSerializeDeserialize}}. Proper breakpoints an inspections 
> shows that, entities are "equal" when it's fields are equal. This is fine as 
> "equal" definition, but another test must be added to check that 
> serialization an deserialization keeps the dirty state.

This message was sent by Atlassian JIRA

Reply via email to