hi Liya,

I left a couple of comments in the document. You might look at what we
have developed in C++ and JavaSript which is more mature and widely
used in those languages than what currently exists in Java.

In particular, I strongly encourage you to avoid creating a coupling
between the Schema (which indicates the "index type" and the
"dictionary type") and the Data (which contains the "dictionary
indices" and the "dictionary values"). If you do this, then it is
nearly impossible to account for dictionary changes in a stream of
data, which is already supported in the IPC protocol, see

https://github.com/apache/arrow/blob/master/docs/source/format/IPC.rst#dictionary-batches

The protocol does not permit dictionaries to change order altogether,
but I think it should. I just opened
https://issues.apache.org/jira/browse/ARROW-5767

We made this mistake early in the C++ library and found it difficult
to have to reconciled dictionaries produced by different threads of
execution. See ARROW-3144 and the patch I wrote in this
release cycle which addressed this design flaw.

Thanks
Wes

On Wed, Jun 12, 2019 at 5:33 AM Fan Liya <[email protected]> wrote:
>
> @Micah Kornfield <[email protected]> Thanks a lot for your comments.
>
> In the doc, we identify 3 problems for the current dictionary encoding use
> case (there can be more, so please give your valuable suggestions):
>
> 1. there should be a convenient way to provide access to both
> encoded/decoded data.
> 2. the constructor for the schema with dictionary is misleading.
> 3. we should provide a way to do the encoding/decoding during
> serialization/deserialization, so the encoding/decoding can be transparent
> to the user.
>
> The current PR provides a solution for problem 2, and it is a relatively
> small change, so we think we can merge this PR first.
>
> Solutions for problem 1 and 3 should be chosen carefully so as not to
> affect existing APIs. So more discussions/designs for problems 1 and 3 are
> desired.
> Do you think it reasonable?
>
> Best,
> Liya Fan
>
>
>
> On Wed, Jun 12, 2019 at 4:01 PM Micah Kornfield <[email protected]>
> wrote:
>
> > Hi Liya Fan,
> > Thanks you for doing this.  I need to take a closer look at the PR in
> > question and the dictionary encoding code but this seems like it is on the
> > right track.
> >
> > Could other java contributors with more familiarity in the space look over
> > the document to make sure it makes sense to them?
> >
> > Thanks,
> > Micah
> >
> > On Mon, Jun 10, 2019 at 2:23 AM Fan Liya <[email protected]> wrote:
> >
> > > Hi all,
> > >
> > > This is concerning issue ARROW-3396.
> > >
> > > I have summarized the problem (please see if my understanding is
> > correct),
> > > and proposed some solutions to it. Please give your valuable feedback.
> > > For details, please see:
> > >
> > >
> > >
> > https://docs.google.com/document/d/1Y2E6RbZkUj3SwuEJrlEjaeIPmCA1SIsi9wmbJmVlB2I/edit?usp=sharing
> > >
> > > Thank you in advance.
> > >
> > > Best,
> > > Liya Fan
> > >
> >

Reply via email to