[
https://issues.apache.org/jira/browse/GORA-401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14250632#comment-14250632
]
Alfonso Nishikawa commented on GORA-401:
----------------------------------------
Hi, [~drazzib], your question is much related, but not exactly the same. When
you wrote that I didn't understood because I was using an older version, but
after upgrading, now I understand you, and I comment the same here bellow in
(1).
The problem I comment arises after GORA-326, applied on August 19th. I will
answer [~renato2099] at the same time :)
Hi, [~renato2099]. When StateManager was deleted and {{__g__dirty}} field was
introduced inside the schema, Avro was serializing it at the same time as the
rest of the fields and the dirty state was traveling in a pack (albeit wrongly
it was loosing the map's k-v dirty state). That was, in my oppinion, a bad
design. In GORA-326, {{__g__dirty}} was removed from the schema fields and
became an inmemory dirty state that is not serialized by Avro. In my opinion is
a better in design because it is not part of the fields in the schema (but
still has flaws).
When an entity is sent from the Map phase to Reduce phase, it is serialized
with the Avro serializer, and loosing the dirty state is a bad thing. Let's see
why:
# You load an entity specifying only a few fields (a subset of all fields), as
we know we can do. Fields not loaded have a null value (or default value for
basic java types)
# After serializing and deserializing, every field becomes dirty.
# When you write, *all* fields gets persisted.
This, simply, was not the behavior when StateManager was in, nor before
GORA-326. But there are more important implications:
* Since every field *eventually* will be written with a null value, you will
have to define your schemas will all fields as "union null". Otherwise you will
always have to read all the entity
* Nutch breaks horribly: after {{updatedb}} all content downloaded is deleted
becasue updatedb does not load that field. I don't know why no one noticed it :P
* If you want to update only one field, you have to read all the fields
*always*. Before this point, you could just read the interesting fields, update
the interesting field and persist.
* If you create a new entity interested only in 1 field, you will have to
assign a value to all fields or define all of them as nullable.
* etc...
About the "two mappers reading the same entity in different machines and
modifying entity differently", the answer is not differente than before
GORA-326: it depends on the situation, and you can mess the same way as now it
is.
Before GORA-326, the dirty fields were the ones being updated, and that is how
I think should be now too. (Obviously, if you wanted to delete a field, you
wrote it blank).
I took a deep look at Nutch and I wrote the effect in the description of this
issue, but I find good if you take a look at Nutch by yourself. Anyway I feel a
bit hurted noticing your preconception about that the problem probably is other
:(
What I suggest:
I find DirtyStateManager the best design approach, but since the dirty state
managing has been shifted to the fields' types, I find ok to reintroduce the
{{PersistentDatumWriter/Reader}}.
(1) And about the question of [~drazzib], before introducing {{__g__dirty}} in
the fields, Maps were managing the key-values added and deleted. Now that
incremental information is not taken into account, forcing to read and write
all the key-values everytime you read/write. I find it wrong, since I that
information was useful to not have to load the field (all k-v) and delete some
key-values (I used to do that), but well... now there are so many changes to
rollback, so ok.
If I had to choose between the StateManager and the state managed in the
instance of Maps I would vote for the StateManager because each backend could
use one state manager properly for each backend. But well... that maybe would
come some day.
Thanks!
> Serialization and deserialization of Persistent does not hold the entity
> dirty state
> ------------------------------------------------------------------------------------
>
> Key: GORA-401
> URL: https://issues.apache.org/jira/browse/GORA-401
> Project: Apache Gora
> Issue Type: Bug
> Components: gora-core
> Affects Versions: 0.4, 0.5
> Environment: Tested on gora-0.4, but seems logically to hold on
> gora-0.5
> Reporter: Alfonso Nishikawa
> Priority: Critical
> Labels: serialization
> Original Estimate: 35h
> Remaining Estimate: 35h
>
> After removing __g__dirty field in GORA-326, dirty field is not serialized.
> In GORA-321
> {{[PersistentSerializer|https://github.com/apache/gora/blob/master/gora-core/src/main/java/org/apache/gora/mapreduce/PersistentSerializer.java]}}
> went from using
> {{[PersistentDatumWriter|https://github.com/apache/gora/blob/apache-gora-0.3/gora-core/src/main/java/org/apache/gora/avro/PersistentDatumWriter.java](/Reader)}}
> to Avro's {{SpecificDatumWriter}}, delegating the serialization of the dirty
> field to Avro (but really not desirable to have that field as a main field in
> the entities).
> The proposal is to reintroduce the {{PersistentDatumWriter/Reader}} which
> will serialize the internal fields of the entities.
> This bug affects, for example, Nutch, which loads only some fields in it's
> phases, serializes entities (from Map to Reduce), and when deserializes finds
> all fields as "dirty", independently of what fields were modified in the Map,
> and overwrite all data in datastore (deleting much things: downloaded
> content, parsed content, etc).
> This effect can be seen in
> {{TestPersistentSerialization#testSerderEmployeeTwoFields}}, when debuging in
> {{TestIOUtils#testSerializeDeserialize}}. Proper breakpoints an inspections
> shows that, entities are "equal" when it's fields are equal. This is fine as
> "equal" definition, but another test must be added to check that
> serialization an deserialization keeps the dirty state.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)