Okay, sounds good. On Fri, Oct 13, 2017 at 2:50 PM, Wes McKinney <wesmck...@gmail.com> wrote:
> It is fine to have not-completely-working states in the refactor > branch. I recommend do whatever is the most expedient thing to help > with making progress. > > - Wes > > On Fri, Oct 13, 2017 at 5:42 PM, Siddharth Teotia <siddha...@dremio.com> > wrote: > > Li, > > > > I think there is some confusion. Are you suggesting merging into "java > > vector refactor" branch or the master? Is it fine to merge stuff on the > > former branch even though few things are broken (around 10 tests) ? If > this > > is allowed, I can do some cleanup (some documentation, some TODOs > suggested > > by you and Brian) and we can merge the current patch by EOD or over the > > weekend. > > > > Is this okay? Since we are going to iterate over this branch and not > going > > to push anything to master until new code is stable, we are probably > good. > > > > Thanks, > > Siddhath > > > > > > > > On Fri, Oct 13, 2017 at 12:17 PM, Li Jin <ice.xell...@gmail.com> wrote: > > > >> Siddharth, > >> > >> Regarding rename: > >> Yes this can be done later. > >> > >> Tests: > >> I agree having code like https://github.com/apache/ > >> arrow/pull/1164/files#diff-0876c9a0005d1dbaea321ea8d39d79ae is hard to > >> maintain even temporarily. I am not sure what's the best way to resolve > >> test failure wrt removing of the accessor/mutator from the vectors. > Maybe > >> we can have change the template the create non-accessor/mutator > >> getter/setters and also remove acessor/mutator in the test for it to > pass? > >> What do you think is the easiest? > >> > >> Reader/Writer: > >> Yes we can address this later. > >> > >> Apologies if I seem to add more work for merging https://github.com/ > >> apache/arrow/pull/1164, that's not my intention, I think the PR looks > >> good - > >> just want to bring up some major design decisions so people can comment > and > >> discuss. > >> > >> Li > >> > >> > >> > >> > >> On Fri, Oct 13, 2017 at 2:37 PM, Siddharth Teotia <siddha...@dremio.com > > > >> wrote: > >> > >> > I am not quite sure of the need to rename the vectors. Why do we need > to > >> > rename? This would first require us to remove all the vectors > generated > >> by > >> > FixedValueVectors.java as they are non-nullable scalar vectors. > Removing > >> > non-nullable vectors is one of the goals, but it can be done once the > new > >> > infrastructure is properly setup? > >> > > >> > In order to merge the existing patch, I first need to address some > >> (10-15) > >> > failures -- few of them are correctness issues w.r.t > >> TestVectorUnloadLoad, > >> > TestArrowFile and rest all are related to getMutator(), getAccessor() > >> > throwing UnsupportedOperationException. This is why I was saying > earlier > >> > that I will end up doing a lot of rework by writing redundant code > where > >> > (if vector instanceof NullableInt or vector instanceof > NullableVarChar) > >> we > >> > don't use the mutator/accessor and for other vectors we use it for the > >> > current patch. These if conditions are getting complicated with ugly > type > >> > casting in some parts of the code -- > >> > https://github.com/apache/arrow/pull/1164/files#diff- > >> > 0876c9a0005d1dbaea321ea8d39d79ae > >> > > >> > So I thought we can probably implement other vectors (remaining > scalars, > >> > map and list) where no vector has mutator/accessor and then for every > >> > ValueVector, we can remove all calls to getMutator(), getAccessor() as > >> > opposed to doing them selectively --- > >> > https://github.com/apache/arrow/pull/1164/files#diff- > >> > e9273a7b3b35ff7f40f101dc2cf95242 > >> > > >> > I will try to address these failures by EOD and see if this patch can > be > >> > merged first. > >> > > >> > Regarding readers and writers, can we address them subsequently? > >> > > >> > On Fri, Oct 13, 2017 at 11:03 AM, Li Jin <ice.xell...@gmail.com> > wrote: > >> > > >> > > Siddharth, > >> > > > >> > > Thanks for the update. I think it's fine to move forward with more > >> > vectors, > >> > > but in the mean time, I think we should also prioritize to merge > >> > > https://github.com/apache/arrow/pull/1164, here are a few comments > >> needs > >> > > to > >> > > be addressed. > >> > > > >> > > (1) Backward-compatibility: > >> > > I think there is no way to maintain backward compability as the new > >> > vector > >> > > classes will be renamed, but want to confirm we are OK with this > >> > decision. > >> > > We also think the disruption on the Spark side are OK as Spark's use > >> case > >> > > is simple and Bryan and I can take care of the code change. > >> > > > >> > > (2) Reader/writer classes: > >> > > How does the reader/writer classes interact with the new and legacy > >> > vector > >> > > classes: > >> > > > >> > > Discussion: https://github.com/apache/arrow/pull/1164#discussion_ > >> > > r144074264 > >> > > > >> > > My thoughts are: > >> > > (1) ArrowReader classes should only return new vector classes > >> > > (2) ArrowWriter classes should only work with new vector classes > >> > > (3) To read/write legacy vectors, we can use adapters to turn legacy > >> > > vectors to new vectors (zero-copy, as the underlying buffers should > be > >> > > transferred directly) > >> > > > >> > > Jacques also has a few comments, I don't know if they have been > >> > addressed. > >> > > > >> > > For other comments, I think we can add TODO and do it later. I > think we > >> > can > >> > > merge this PR if we address (1) (2) above. > >> > > > >> > > Comments? > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > > On Fri, Oct 13, 2017 at 12:36 PM, Siddharth Teotia < > >> siddha...@dremio.com > >> > > > >> > > wrote: > >> > > > >> > > > The patch that I have put up https://github.com/apache/ > >> arrow/pull/1198 > >> > > > seems to be in a reasonable state. We are now working off a > different > >> > > > branch "java vector refactor". > >> > > > > >> > > > Now that we have the basic structure, in order to make quick > forward > >> > > > progress, I would like to go ahead and do for other types (FLOAT, > >> > BIGINT > >> > > > etc), list, map and create their legacy > >> > > > counter parts -- doing them in subsequent patches is requiring me > to > >> > > write > >> > > > some duplicate code and redundant if conditions in code that > expects > >> > all > >> > > > the vectors to have mutator/accessor. > >> > > > > >> > > > Is that fine? Just wanted to check with people and ensure there > >> aren't > >> > > any > >> > > > major concerns. > >> > > > > >> > > > The feedback on the PR (original one for master > >> > > > https://github.com/apache/arrow/pull/1164) has been really good > -- > >> > some > >> > > of > >> > > > the comments are yet to be addressed and we jointly decided to > >> address > >> > > few > >> > > > things (like Minor Type etc) after the refactoring has been done. > >> > > > > >> > > > On the testing front, as far as the correctness is concerned, I > have > >> > two > >> > > > failures in TestArrowFile and TestValueVector. I have added some > more > >> > > tests > >> > > > too. > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > On Thu, Oct 12, 2017 at 2:18 PM, Siddharth Teotia < > >> > siddha...@dremio.com> > >> > > > wrote: > >> > > > > >> > > > > Yes, that is the intention. Good that we all are on the same > page. > >> I > >> > > will > >> > > > > move the PR https://github.com/apache/arrow/pull/1164 to new > >> branch. > >> > > > > > >> > > > > On Thu, Oct 12, 2017 at 11:20 AM, Li Jin <ice.xell...@gmail.com > > > >> > > wrote: > >> > > > > > >> > > > >> To make clear, I think it's fine to have Legacy Vectors in 0.8 > as > >> a > >> > > > >> deprecated API. > >> > > > >> > >> > > > >> On Thu, Oct 12, 2017 at 2:19 PM, Li Jin <ice.xell...@gmail.com > > > >> > > wrote: > >> > > > >> > >> > > > >> > Siddharth, > >> > > > >> > > >> > > > >> > For working off a branch, Wes has created > >> > > https://github.com/apache/ > >> > > > >> > arrow/tree/java-vector-refactor that we can submit PR to. > >> > > > >> > > >> > > > >> > For Legacy vectors, I think it's fine because it's really > just a > >> > > > >> migration > >> > > > >> > path to help Dremio to migrate to the new vectors. I don't > think > >> > > other > >> > > > >> > users, i.e., Spark will use the Legacy vector class. Bryan > and I > >> > > will > >> > > > >> just > >> > > > >> > migrate Spark to new vectors directly because Spark's use of > >> Arrow > >> > > is > >> > > > >> very > >> > > > >> > simple. > >> > > > >> > > >> > > > >> > > >> > > > >> > > >> > > > >> > On Thu, Oct 12, 2017 at 2:08 PM, Siddharth Teotia < > >> > > > siddha...@dremio.com > >> > > > >> > > >> > > > >> > wrote: > >> > > > >> > > >> > > > >> >> Thanks Bryan and Li. > >> > > > >> >> > >> > > > >> >> Yes, the goal is to get this (and the subsequent patches) > >> merged > >> > to > >> > > > the > >> > > > >> >> new > >> > > > >> >> branch. Once it is stabilized from different aspects, we can > >> move > >> > > to > >> > > > >> >> master. I am not sure of the exact mechanics when we work > off a > >> > > > >> different > >> > > > >> >> project branch and not master. > >> > > > >> >> > >> > > > >> >> Does that sound good? > >> > > > >> >> > >> > > > >> >> Regarding compatibility, are we suggesting that let's not > >> create > >> > > > Legacy > >> > > > >> >> Nullable vectors at all? The initial thoughts were to > generate > >> > > Legacy > >> > > > >> >> vectors from NullableValueVectors template and these vectors > >> are > >> > > > >> >> mutator/accessor based (in today's world). Internally each > >> > > operation > >> > > > >> will > >> > > > >> >> be delegated to new vectors (non code generated). > >> > > > >> >> > >> > > > >> >> On Thu, Oct 12, 2017 at 10:58 AM, Bryan Cutler < > >> > cutl...@gmail.com> > >> > > > >> wrote: > >> > > > >> >> > >> > > > >> >> > Thanks for the update Siddharth. From the Spark side of > >> this, > >> > I > >> > > > >> >> definitely > >> > > > >> >> > want to try to upgrade to the latest Arrow before the > Spark > >> 2.3 > >> > > > >> release > >> > > > >> >> but > >> > > > >> >> > if it the refactor is too disruptive then others might get > >> > > > squeamish > >> > > > >> >> about > >> > > > >> >> > upgrading. On the other hand, I don't think we should > hold > >> > back > >> > > on > >> > > > >> >> > refactoring for compatibility sake and the way it's > looking > >> now > >> > > > >> trying > >> > > > >> >> to > >> > > > >> >> > be backwards-compatible will be too much of a pain. I > will > >> try > >> > > to > >> > > > >> >> figure > >> > > > >> >> > out the timeline for Spark 2.3 and what the feeling is for > >> > > > upgrading > >> > > > >> >> > Arrow. Can we hold off on merging this to master for now > and > >> > > just > >> > > > >> work > >> > > > >> >> out > >> > > > >> >> > of the separate branch until we can get a better feeling > for > >> > the > >> > > > >> impact? > >> > > > >> >> > > >> > > > >> >> > Bryan > >> > > > >> >> > > >> > > > >> >> > On Wed, Oct 11, 2017 at 8:19 AM, Li Jin < > >> ice.xell...@gmail.com > >> > > > >> > > > >> wrote: > >> > > > >> >> > > >> > > > >> >> > > Hi Siddharth, > >> > > > >> >> > > > >> > > > >> >> > > Thanks for the update. This looks good. > >> > > > >> >> > > > >> > > > >> >> > > A few thoughts: > >> > > > >> >> > > > >> > > > >> >> > > *Compatibility:* > >> > > > >> >> > > It sounds like we will introduce some back-compatibility > >> with > >> > > the > >> > > > >> new > >> > > > >> >> > > Vector class. At this point I think our main Java users > >> > should > >> > > be > >> > > > >> >> Spark > >> > > > >> >> > and > >> > > > >> >> > > Dremio, is this right? > >> > > > >> >> > > > >> > > > >> >> > > > >> > > > >> >> > > - For Spark: > >> > > > >> >> > > > >> > > > >> >> > > It seems fine since Spark uses just the basic > functionality > >> > of > >> > > > >> Vector > >> > > > >> >> > > classes and the existing code should work with the new > >> Vector > >> > > > >> classes, > >> > > > >> >> > > maybe even without any code change on the Spark side. > >> > > > >> >> > > > >> > > > >> >> > > > >> > > > >> >> > > - For Dremio: > >> > > > >> >> > > > >> > > > >> >> > > Sounds like you are already taking care of this by > >> > introducing > >> > > > the > >> > > > >> >> > > LegacyVector classes. > >> > > > >> >> > > > >> > > > >> >> > > > >> > > > >> >> > > *Testing:* > >> > > > >> >> > > > >> > > > >> >> > > - Spark Integration Tests: > >> > > > >> >> > > > >> > > > >> >> > > Bryan and I can help with integration test with Spark. I > >> > think > >> > > > the > >> > > > >> >> target > >> > > > >> >> > > timeline for Spark 2.3 release is some time in mid Nov > >> (Bryan > >> > > > >> please > >> > > > >> >> > > correct me if I am wrong). > >> > > > >> >> > > > >> > > > >> >> > > I will take a look at the PR today. > >> > > > >> >> > > > >> > > > >> >> > > > >> > > > >> >> > > > >> > > > >> >> > > > >> > > > >> >> > > > >> > > > >> >> > > > >> > > > >> >> > > On Tue, Oct 10, 2017 at 4:29 PM, Siddharth Teotia < > >> > > > >> >> siddha...@dremio.com> > >> > > > >> >> > > wrote: > >> > > > >> >> > > > >> > > > >> >> > > > Hi All, > >> > > > >> >> > > > > >> > > > >> >> > > > I wanted to update everyone on state of this > >> mini-project: > >> > > > >> >> > > > > >> > > > >> >> > > > > >> > > > >> >> > > > - Requirements document and initial design proposal > >> were > >> > > > sent > >> > > > >> >> out to > >> > > > >> >> > > the > >> > > > >> >> > > > community for review and we have received some good > >> > > > feedback. > >> > > > >> All > >> > > > >> >> > > > required > >> > > > >> >> > > > docs are attached with corresponding JIRAs. > >> > > > >> >> > > > > >> > > > >> >> > > > > >> > > > >> >> > > > - The initial prototype is in a reasonable state > >> > > > >> (code-complete). > >> > > > >> >> > You > >> > > > >> >> > > > can see the PR here - > https://github.com/apache/arro > >> > > > >> w/pull/1164 > >> > > > >> >> > > > > >> > > > >> >> > > > > >> > > > >> >> > > > - The prototype has code changes for the new > >> hierarchy, > >> > > > >> abstract > >> > > > >> >> > > > interfaces for fixed width and variable width > vectors > >> > and > >> > > > >> >> concrete > >> > > > >> >> > > > implementation of NullableIntVector and > >> > > > NullableVarCharVector. > >> > > > >> >> > > > > >> > > > >> >> > > > > >> > > > >> >> > > > Plan for testing and integrating into existing > >> > > infrastructure: > >> > > > >> >> > > > > >> > > > >> >> > > > > >> > > > >> >> > > > - My initial thoughts are that this particular > patch > >> > will > >> > > > >> >> require a > >> > > > >> >> > > lot > >> > > > >> >> > > > of testing, reviews etc since the foundation of > rest > >> of > >> > > the > >> > > > >> >> > > > implementation > >> > > > >> >> > > > more or less depends on how the APIs are flushed > out > >> > here. > >> > > > >> >> > > > > >> > > > >> >> > > > > >> > > > >> >> > > > - So the goal is to get this properly tested and > >> merged > >> > > into > >> > > > >> >> master > >> > > > >> >> > > > first. > >> > > > >> >> > > > > >> > > > >> >> > > > > >> > > > >> >> > > > - The idea is to slowly deprecate and remove the > >> > existing > >> > > > >> >> vectors in > >> > > > >> >> > > > stages. In this patch itself, we change the > existing > >> > > > >> >> > > > NullableValueVectors.java template to generate > >> > > > >> >> > LegacyNullableIntVector > >> > > > >> >> > > > and > >> > > > >> >> > > > LegacyNullableVarCharVector. Each operation on > these > >> > > vectors > >> > > > >> will > >> > > > >> >> > > > delegate > >> > > > >> >> > > > to the corresponding NullableIntVector and > >> > > > >> NullableVarCharVector > >> > > > >> >> > that > >> > > > >> >> > > > are > >> > > > >> >> > > > newly implemented. > >> > > > >> >> > > > > >> > > > >> >> > > > > >> > > > >> >> > > > - This achieves two goals w.r.t testing: > >> > > > >> >> > > > > >> > > > >> >> > > > > >> > > > >> >> > > > - Firstly, our existing JAVA unit tests will > >> > automatically > >> > > > >> >> exercise > >> > > > >> >> > > the > >> > > > >> >> > > > newly written code and its APIs (API names have > not > >> > > > >> changed) > >> > > > >> >> for > >> > > > >> >> > > > NullableInt and NullableVarChar vectors. > >> > > > >> >> > > > > >> > > > >> >> > > > > >> > > > >> >> > > > - Secondly, let's say we rebase Dremio on top of > Arrow > >> > > > master > >> > > > >> and > >> > > > >> >> > > > replace all references to NullableIntVector and > >> > > > >> >> > > > NullableVarCharVector with > >> > > > >> >> > > > their Legacy counterparts, things should still > >> work. > >> > > > >> >> > > > > >> > > > >> >> > > > > >> > > > >> >> > > > - After this patch gets merged, we can do the > >> following > >> > > work > >> > > > >> in > >> > > > >> >> > > multiple > >> > > > >> >> > > > patches: > >> > > > >> >> > > > - Write concrete implementations for rest of the > >> > > nullable > >> > > > >> >> types > >> > > > >> >> > -- > >> > > > >> >> > > > FLOAT4, FLOAT8, BIGINT, VARBINARY etc > >> > > > >> >> > > > > >> > > > >> >> > > > > >> > > > >> >> > > > - Write additional tests (definitely needed but the > >> > first > >> > > > goal > >> > > > >> >> is to > >> > > > >> >> > > > make sure existing tests are not broken). > >> > > > >> >> > > > > >> > > > >> >> > > > > >> > > > >> >> > > > - Ensure NullableValueVectors template generates > >> Legacy > >> > > > >> vectors > >> > > > >> >> and > >> > > > >> >> > > each > >> > > > >> >> > > > operation is merely a delegation to the API in > new > >> > > > >> >> > implementation. > >> > > > >> >> > > > > >> > > > >> >> > > > > >> > > > >> >> > > > - In the next Arrow release, remove all Legacy > vectors > >> > and > >> > > > >> >> > > > NullableValueVectors template since we will have > >> the > >> > > > >> >> > implementation > >> > > > >> >> > > > for > >> > > > >> >> > > > each type that passes existing tests. > >> > > > >> >> > > > > >> > > > >> >> > > > > >> > > > >> >> > > > - I am currently inspecting the newly written code > and > >> > > > making > >> > > > >> >> > changes > >> > > > >> >> > > to > >> > > > >> >> > > > the template to generate Legacy vector types for > >> > > Nullable > >> > > > >> Int > >> > > > >> >> > > > and Nullable > >> > > > >> >> > > > VarChar and delegating the operations. The > changes > >> > > should > >> > > > >> be > >> > > > >> >> > > > available in > >> > > > >> >> > > > the PR in a couple of hours. > >> > > > >> >> > > > > >> > > > >> >> > > > > >> > > > >> >> > > > I am wondering if there are any other ideas around > >> testing, > >> > > > >> merging > >> > > > >> >> > etc. > >> > > > >> >> > > > Please feel free to reply here or comment on the PR. > >> > > > >> >> > > > > >> > > > >> >> > > > I would appreciate if people can take time to review > the > >> > code > >> > > > in > >> > > > >> PR > >> > > > >> >> -- > >> > > > >> >> > > > especially the abstract classes BaseNullableFixedWidth > >> and > >> > > > >> >> > > > BaseNullableVariableWidth. Writing concrete > >> implementations > >> > > for > >> > > > >> >> other > >> > > > >> >> > > types > >> > > > >> >> > > > will be much less hassle if these abstract classes > have > >> > > proper > >> > > > >> code. > >> > > > >> >> > > > > >> > > > >> >> > > > Thanks, > >> > > > >> >> > > > Siddharth > >> > > > >> >> > > > > >> > > > >> >> > > > >> > > > >> >> > > >> > > > >> >> > >> > > > >> > > >> > > > >> > > >> > > > >> > >> > > > > > >> > > > > > >> > > > > >> > > > >> > > >> >