[ 
https://issues.apache.org/jira/browse/GORA-401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14285963#comment-14285963
 ] 

Alfonso Nishikawa commented on GORA-401:
----------------------------------------

Hi all, and hi [~renato2099] (answering).
I disassigned me the issue because it is a VERY complex issue and help will be 
needed for the datastores but HBase. I uploaded a Q&D (Quick & Dirty) patch 
that surely works for HBase, while the other datastores must be checked (I 
don't have time at this moment to make a complete mvn test, have to leave in a 
few minutes). An elegant and engineering solution seems a bit hard, and what I 
did is revert part of GORA-321 (and some additional work) I will detail the 
changes bellow.

Before, just comment that _I think_ I achieved to get dirty bytes serialized on 
Map Reduce and not serialized when persisting.

Why is Quick and Dirty?
- There is a need for a `FakeResolvingDecoder` (reverted), but now Avro's 
`ResolvingDecoder` has a package constructor (when gora 0.3, it was public). 
So: had to put FakeResolvingDecoder in that package (dirty, dirty!) and export 
in osgi. *Please, someone check what I wrote in osgi.export in 
`gora-core/pom.xml`*. I don't know anything about osgi and I wrote something 
the best as I could.
- Recreated MockPersistent, which was outdated, and did not exists an .json.
- Reverted a plethora of classes.
- Modified `PersistentDatumReader#readRecord` to return `Object` because 
sometimes returns a record, sometimes return other things (specifically in 
unions). This was not happening in Gora-0.3, or at least not detected. A big 
pain to debug and fix.
- HBaseByteInterface#toBytes() / fromBytes() uses SpecificDatumWriter/Reader, 
so no dirty bytes are serialized/deserialized when writing to the dataStore.
- Need public `getDirtyBytes()` and `setDirtyBytes()` in PersistentBase to get 
and restore the dirty bytes when serializing.

Maybe some test about checking the dirty state will have to be improved.

(and the code in PersistentDatumWriter must be improved, please, don't look at 
it. I am embarrassed and I had no time to fix it)

Comments? Help with the rest of datastores?

Thanks!

> Serialization and deserialization of Persistent does not hold the entity 
> dirty state from Map to Reduce
> -------------------------------------------------------------------------------------------------------
>
>                 Key: GORA-401
>                 URL: https://issues.apache.org/jira/browse/GORA-401
>             Project: Apache Gora
>          Issue Type: Bug
>          Components: gora-core
>    Affects Versions: 0.4, 0.5
>         Environment: Tested on gora-0.4, but seems logically to hold on 
> gora-0.5. HBase backend.
>            Reporter: Alfonso Nishikawa
>            Priority: Critical
>              Labels: serialization
>         Attachments: GORA-401-tests.patch, GORA-401v1.patch
>
>   Original Estimate: 35h
>          Time Spent: 21h
>  Remaining Estimate: 14h
>
> After removing __g__dirty field in GORA-326, dirty field is not serialized. 
> In GORA-321 
> {{[PersistentSerializer|https://github.com/apache/gora/blob/master/gora-core/src/main/java/org/apache/gora/mapreduce/PersistentSerializer.java]}}
>  went from using 
> {{[PersistentDatumWriter|https://github.com/apache/gora/blob/apache-gora-0.3/gora-core/src/main/java/org/apache/gora/avro/PersistentDatumWriter.java](/Reader)}}
>  to Avro's {{SpecificDatumWriter}}, delegating the serialization of the dirty 
> field to Avro (but really not desirable to have that field as a main field in 
> the entities).
> The proposal is to reintroduce the {{PersistentDatumWriter/Reader}} which 
> will serialize the internal fields of the entities.
> This bug affects, for example, Nutch, which loads only some fields in it's 
> phases, serializes entities (from Map to Reduce), and when deserializes finds 
> all fields as "dirty", independently of what fields were modified in the Map, 
> and overwrite all data in datastore (deleting much things: downloaded 
> content, parsed content, etc).
> This effect can be seen in 
> {{TestPersistentSerialization#testSerderEmployeeTwoFields}}, when debuging in 
> {{TestIOUtils#testSerializeDeserialize}}. Proper breakpoints an inspections 
> shows that, entities are "equal" when it's fields are equal. This is fine as 
> "equal" definition, but another test must be added to check that 
> serialization an deserialization keeps the dirty state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to