[
https://issues.apache.org/jira/browse/GORA-401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15501312#comment-15501312
]
Kevin Ratnasekera edited comment on GORA-401 at 9/18/16 5:17 PM:
-----------------------------------------------------------------
[~alfonso.nishikawa] I don't have any background knowledge as you do regarding
this issue and related issues in Apache Nutch. I had a look on the code, it
seems like PersistentSerializer and PersistentDeserializer can be modified to
preserve the dirty bytes. Since we are registering those
Serializer/Deserializer for persistent data beans at Hadoop conf, all the
serialization/de-serialization will be delegated to these classes.
I have added simple check for dirty bytes comparison in
TestIOUtils#testSerializeDeserialize, so that we can make whether proposed fix
preserve the dirty bytes, after data bean is serialized and de-serializated,
earlier checks only guarantees equality of fields between two persistent data
beans.
I submitted a pull request on this, please do review and if you can provide
some more directions, I can have detailed look into the issue.
Regards
Kevin
was (Author: djkevincr):
[~alfonso.nishikawa] I don't have any background knowledge as you do regarding
this issue and related issues in Apache Nutch. I had a look on the code, it
seems like PersistentSerializer and PersistentDeserializer can be modified to
preserve the dirty bytes. Since we are registering those
Serializer/Deserializer for persistent data beans at Hadoop conf, all the
serialization/de-serialization will be delegated to these classes.
I have added simple check for dirty bytes comparison in
TestIOUtils#testSerializeDeserialize, so that we can make whether proposed fix
preserve the dirty bytes, after data bean is serialized and de-serializated,
earlier checks only guarantees equality of fields between two persistent data
beans.
Please do review and if you can provide some more directions, I can have
detailed look into the issue.
Regards
Kevin
> Serialization and deserialization of Persistent does not hold the entity
> dirty state from Map to Reduce
> -------------------------------------------------------------------------------------------------------
>
> Key: GORA-401
> URL: https://issues.apache.org/jira/browse/GORA-401
> Project: Apache Gora
> Issue Type: Bug
> Components: gora-core
> Affects Versions: 0.4, 0.5
> Environment: Tested on gora-0.4, but seems logically to hold on
> gora-0.5. HBase backend.
> Reporter: Alfonso Nishikawa
> Assignee: Alfonso Nishikawa
> Priority: Critical
> Labels: serialization
> Fix For: 0.8
>
> Attachments: GORA-401-tests.patch, GORA-401v1.patch,
> GORA-401v2.patch, GORA-401v3.patch, GORA-401v4.patch, GORA-401v5.patch
>
> Original Estimate: 35h
> Time Spent: 21h
> Remaining Estimate: 14h
>
> After removing __g__dirty field in GORA-326, dirty field is not serialized.
> In GORA-321
> {{[PersistentSerializer|https://github.com/apache/gora/blob/master/gora-core/src/main/java/org/apache/gora/mapreduce/PersistentSerializer.java]}}
> went from using
> {{[PersistentDatumWriter|https://github.com/apache/gora/blob/apache-gora-0.3/gora-core/src/main/java/org/apache/gora/avro/PersistentDatumWriter.java](/Reader)}}
> to Avro's {{SpecificDatumWriter}}, delegating the serialization of the dirty
> field to Avro (but really not desirable to have that field as a main field in
> the entities).
> The proposal is to reintroduce the {{PersistentDatumWriter/Reader}} which
> will serialize the internal fields of the entities.
> This bug affects, for example, Nutch, which loads only some fields in it's
> phases, serializes entities (from Map to Reduce), and when deserializes finds
> all fields as "dirty", independently of what fields were modified in the Map,
> and overwrite all data in datastore (deleting much things: downloaded
> content, parsed content, etc).
> This effect can be seen in
> {{TestPersistentSerialization#testSerderEmployeeTwoFields}}, when debuging in
> {{TestIOUtils#testSerializeDeserialize}}. Proper breakpoints an inspections
> shows that, entities are "equal" when it's fields are equal. This is fine as
> "equal" definition, but another test must be added to check that
> serialization an deserialization keeps the dirty state.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)