[jira] [Commented] (ARROW-5270) [C++] Reenable Valgrind on Travis-CI
[ https://issues.apache.org/jira/browse/ARROW-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16839104#comment-16839104 ] Pindikura Ravindra commented on ARROW-5270: --- [https://travis-ci.org/apache/arrow/jobs/531878628] > [C++] Reenable Valgrind on Travis-CI > > > Key: ARROW-5270 > URL: https://issues.apache.org/jira/browse/ARROW-5270 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Continuous Integration >Reporter: Antoine Pitrou >Priority: Critical > Labels: pull-request-available > Fix For: 0.14.0 > > Time Spent: 40m > Remaining Estimate: 0h > > Running Valgrind on Travis-CI was disabled in ARROW-4611 (apparently because > of issues within the re2 library). > We should reenable it at some point in order to exercise the reliability of > our C++ code. > (and/or have a build with another piece of instrumentation enabled such as > ASAN) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-5270) [C++] Reenable Valgrind on Travis-CI
[ https://issues.apache.org/jira/browse/ARROW-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16839103#comment-16839103 ] Pindikura Ravindra commented on ARROW-5270: --- There are two issues : # instructions not recognized by valgrind =20276== Your program just tried to execute an instruction that Valgrind ==20276== did not recognise. There are two possible reasons for this. ==20276== 1. Your program has a bug and erroneously jumped to a non-code ==20276== location. If you are running Memcheck and you just saw a ==20276== warning about a bad jump, it's probably your program's fault. 2. the re2 issues I think these are already covered by the suppressions listed in the valgrind.supp but they aren't being recognized due to missing symbols in the stack. When I ran this on my xenial setup without any conda setup, the stacks showed up correctly and got suppressed. so, I suspect this is an issue with conda binaries. > [C++] Reenable Valgrind on Travis-CI > > > Key: ARROW-5270 > URL: https://issues.apache.org/jira/browse/ARROW-5270 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Continuous Integration >Reporter: Antoine Pitrou >Priority: Critical > Labels: pull-request-available > Fix For: 0.14.0 > > Time Spent: 40m > Remaining Estimate: 0h > > Running Valgrind on Travis-CI was disabled in ARROW-4611 (apparently because > of issues within the re2 library). > We should reenable it at some point in order to exercise the reliability of > our C++ code. > (and/or have a build with another piece of instrumentation enabled such as > ASAN) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-5272) [C++] [Gandiva] JIT code executed over uninitialized values
[ https://issues.apache.org/jira/browse/ARROW-5272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16839090#comment-16839090 ] Pindikura Ravindra commented on ARROW-5272: --- [~pitrou] I tried this on my xenial setup (on GCE) with the same valgrind settings, and wasn't able to reproduce this. The travis build also didn't show failures in the decimal test [https://travis-ci.org/apache/arrow/jobs/531878628] Were you using some additional valgrind flags ? > [C++] [Gandiva] JIT code executed over uninitialized values > --- > > Key: ARROW-5272 > URL: https://issues.apache.org/jira/browse/ARROW-5272 > Project: Apache Arrow > Issue Type: Bug > Components: C++ - Gandiva >Reporter: Antoine Pitrou >Assignee: Pindikura Ravindra >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > When running Gandiva tests with Valgrind, I get the following errors: > {code} > [==] Running 4 tests from 1 test case. > [--] Global test environment set-up. > [--] 4 tests from TestDecimal > [ RUN ] TestDecimal.TestSimple > ==12052== Conditional jump or move depends on uninitialised value(s) > ==12052==at 0x41110D5: ??? > ==12052== > { > >Memcheck:Cond >obj:* > } > ==12052== Conditional jump or move depends on uninitialised value(s) > ==12052==at 0x41110E8: ??? > ==12052== > { > >Memcheck:Cond >obj:* > } > ==12052== Conditional jump or move depends on uninitialised value(s) > ==12052==at 0x44B: ??? > ==12052== > { > >Memcheck:Cond >obj:* > } > ==12052== Conditional jump or move depends on uninitialised value(s) > ==12052==at 0x47B: ??? > ==12052== > { > >Memcheck:Cond >obj:* > } > [ OK ] TestDecimal.TestSimple (16625 ms) > [ RUN ] TestDecimal.TestLiteral > [ OK ] TestDecimal.TestLiteral (3480 ms) > [ RUN ] TestDecimal.TestIfElse > [ OK ] TestDecimal.TestIfElse (2408 ms) > [ RUN ] TestDecimal.TestCompare > [ OK ] TestDecimal.TestCompare (5303 ms) > {code} > I think this is legitimate. Gandiva runs computations over all values, even > when the bitmap indicates a null value. But decimal computations are complex > and involve conditional jumps, hence the error ("Conditional jump or move > depends on uninitialised value(s)"). > [~pravindra] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-5224) [Java] Add APIs for supporting directly serialize/deserialize ValueVector
[ https://issues.apache.org/jira/browse/ARROW-5224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16839050#comment-16839050 ] Micah Kornfield commented on ARROW-5224: [~tianchen92] my main concern with this change is that it shouldn't be a one-off for java. If there is utility of these types of on the wire encodings we should come up with a supportable way to make them work across language implementations. I think this is important to discuss on the mailing list directly (many people filter out JIRA/Pull requests). Real performance numbers/benchmarks would be helpful in making the case to support this. Also, I'm also curious if you measured to doing blackbox compression with something like snappy (the link I provided above) to see if there is still benefit of the encoding after applying compression, to the entire vector. If we are going to make encodings supportable we should either extend Schema.fbs or use the custom metadata that is already built into the schema (https://github.com/apache/arrow/blob/master/format/Schema.fbs#L265) so encodings can be communicated across clients. Again since convention/design needs to be agreed upon discussing on the mailing list is important. I think a utility class to convert between BigIntVector and encoded VarBinaryVector could also be a potentially valuable contribution, but for this use-case I think you lose a lot of the value of encoding (you have a 4-byte overhead to keep track of the offsets per encoded entry). > [Java] Add APIs for supporting directly serialize/deserialize ValueVector > - > > Key: ARROW-5224 > URL: https://issues.apache.org/jira/browse/ARROW-5224 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Ji Liu >Assignee: Ji Liu >Priority: Minor > Labels: pull-request-available > Time Spent: 2.5h > Remaining Estimate: 0h > > There is no API to directly serialize/deserialize ValueVector. The only way > to implement this is to put a single FieldVector in VectorSchemaRoot and > convert it to ArrowRecordBatch, and the deserialize process is as well. > Provide a utility class to implement this may be better, I know all > serializations should follow IPC format so that data can be shared between > different Arrow implementations. But for users who only use Java API and want > to do some further optimization, this seem to be no problem and we could > provide them a more option. > This may take some benefits for Java user who only use ValueVector rather > than IPC series classes such as ArrowReordBatch: > * We could do some shuffle optimization such as compression and some > encoding algorithm for numerical type which could greatly improve performance. > * Do serialize/deserialize with the actual buffer size within vector since > the buffer size is power of 2 which is actually bigger than it really need. > * Reduce data conversion(VectorSchemaRoot, ArrowRecordBatch etc) to make it > user-friendly. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-5224) [Java] Add APIs for supporting directly serialize/deserialize ValueVector
[ https://issues.apache.org/jira/browse/ARROW-5224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16839020#comment-16839020 ] Ji Liu commented on ARROW-5224: --- [~emkornfi...@gmail.com] [~bryanc] Thanks for your comments. Sure we have tested the performance with encoding Arrow in our application, and it shows this will significantly reduce shuffle data with equal or even less E2E time (for Int and BigInt type). I agree with [~bryanc], we could simply provide a utility class to encode BigIntVector into a VarBinaryVector(The only thing I'm worried about is whether multiple transformations will result in significant performance overhead). In this way, we won‘t break the existing APIs & protocol. I would like to work in this way and test the performance as well. If this works fine, we can further extend it to other languages. What do you think? > [Java] Add APIs for supporting directly serialize/deserialize ValueVector > - > > Key: ARROW-5224 > URL: https://issues.apache.org/jira/browse/ARROW-5224 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Ji Liu >Assignee: Ji Liu >Priority: Minor > Labels: pull-request-available > Time Spent: 2.5h > Remaining Estimate: 0h > > There is no API to directly serialize/deserialize ValueVector. The only way > to implement this is to put a single FieldVector in VectorSchemaRoot and > convert it to ArrowRecordBatch, and the deserialize process is as well. > Provide a utility class to implement this may be better, I know all > serializations should follow IPC format so that data can be shared between > different Arrow implementations. But for users who only use Java API and want > to do some further optimization, this seem to be no problem and we could > provide them a more option. > This may take some benefits for Java user who only use ValueVector rather > than IPC series classes such as ArrowReordBatch: > * We could do some shuffle optimization such as compression and some > encoding algorithm for numerical type which could greatly improve performance. > * Do serialize/deserialize with the actual buffer size within vector since > the buffer size is power of 2 which is actually bigger than it really need. > * Reduce data conversion(VectorSchemaRoot, ArrowRecordBatch etc) to make it > user-friendly. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-5102) [C++] Reduce header dependencies
[ https://issues.apache.org/jira/browse/ARROW-5102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16838962#comment-16838962 ] Wes McKinney commented on ARROW-5102: - I would be in favor of adding a {{StatusBuilder}} API > [C++] Reduce header dependencies > > > Key: ARROW-5102 > URL: https://issues.apache.org/jira/browse/ARROW-5102 > Project: Apache Arrow > Issue Type: Wish > Components: C++ >Affects Versions: 0.13.0 >Reporter: Antoine Pitrou >Priority: Major > Fix For: 0.14.0 > > > To tame C++ compile times, we should try to reduce the number of heavy > dependencies in our .h files. > Two possible avenues come to mind: > * avoid including `unordered_map` and friends > * avoid including C++ stream libraries (such as `iostream`, `ios`, > `sstream`...) > Unfortunately we're currently including `sstream` in `status.h` for some > template APIs. We may move those to a separate include file (e.g. > `status-builder.h`). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-5314) [Go] Incorrect Printing for String Arrays with Offsets
[ https://issues.apache.org/jira/browse/ARROW-5314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-5314: -- Labels: pull-request-available (was: ) > [Go] Incorrect Printing for String Arrays with Offsets > --- > > Key: ARROW-5314 > URL: https://issues.apache.org/jira/browse/ARROW-5314 > Project: Apache Arrow > Issue Type: Bug > Components: Go >Reporter: James Walker >Priority: Trivial > Labels: pull-request-available > > If an additional string field is added to the Table Example > ([https://github.com/apache/arrow/blob/master/go/arrow/example_test.go#L495-L546)] > the Table Reader outputs unexpected results. > Modified Table example: > {code:java} > pool := memory.NewGoAllocator() > schema := arrow.NewSchema( > []arrow.Field{ > arrow.Field{Name: "f1-i32", Type: arrow.PrimitiveTypes.Int32}, > arrow.Field{Name: "f2-f64", Type: arrow.PrimitiveTypes.Float64}, > arrow.Field{Name: "string", Type: arrow.BinaryTypes.String}, > }, > nil, > ) > b := array.NewRecordBuilder(pool, schema) > defer b.Release() > b.Field(0).(*array.Int32Builder).AppendValues([]int32{1, 2, 3, 4, 5, 6}, nil) > b.Field(0).(*array.Int32Builder).AppendValues([]int32{7, 8, 9, 10}, > []bool{true, true, false, true}) > b.Field(1).(*array.Float64Builder).AppendValues([]float64{1, 2, 3, 4, 5, 6, > 7, 8, 9, 10}, nil) > b.Field(2).(*array.StringBuilder).AppendValues([]string{ > "one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten", > }, nil) > rec1 := b.NewRecord() > defer rec1.Release() > b.Field(0).(*array.Int32Builder).AppendValues([]int32{11, 12, 13, 14, 15, 16, > 17, 18, 19, 20}, nil) > b.Field(1).(*array.Float64Builder).AppendValues([]float64{11, 12, 13, 14, 15, > 16, 17, 18, 19, 20}, nil) > b.Field(2).(*array.StringBuilder).AppendValues([]string{ > "eleven", "twelve", "thirteen", "fourteen", "fifteen", "sixteen", > "seventeen", "eighteen", "nineteen", "twenty", > }, nil) > rec2 := b.NewRecord() > defer rec2.Release() > tbl := array.NewTableFromRecords(schema, []array.Record{rec1, rec2}) > defer tbl.Release() > tr := array.NewTableReader(tbl, 2) > defer tr.Release() > n := 0 > for tr.Next() { > rec := tr.Record() > for i, col := range rec.Columns() { > fmt.Printf("rec[%d][%q]: %v\n", n, rec.ColumnName(i), col) > } > n++ > } > {code} > > output: > {code:java} > rec[0]["f1-i32"]: [1 2] > rec[0]["f2-f64"]: [1 2] > rec[0]["string"]: ["one" "two"] > rec[1]["f1-i32"]: [3 4] > rec[1]["f2-f64"]: [3 4] > rec[1]["string"]: ["one" "two"] > rec[2]["f1-i32"]: [5 6] > rec[2]["f2-f64"]: [5 6] > rec[2]["string"]: ["one" "two"] > rec[3]["f1-i32"]: [7 8] > rec[3]["f2-f64"]: [7 8] > rec[3]["string"]: ["one" "two"] > rec[4]["f1-i32"]: [(null) 10] > rec[4]["f2-f64"]: [9 10] > rec[4]["string"]: ["one" "two"] > rec[5]["f1-i32"]: [11 12] > rec[5]["f2-f64"]: [11 12] > rec[5]["string"]: ["eleven" "twelve"] > rec[6]["f1-i32"]: [13 14] > rec[6]["f2-f64"]: [13 14] > rec[6]["string"]: ["eleven" "twelve"] > rec[7]["f1-i32"]: [15 16] > rec[7]["f2-f64"]: [15 16] > rec[7]["string"]: ["eleven" "twelve"] > rec[8]["f1-i32"]: [17 18] > rec[8]["f2-f64"]: [17 18] > rec[8]["string"]: ["eleven" "twelve"] > rec[9]["f1-i32"]: [19 20] > rec[9]["f2-f64"]: [19 20] > rec[9]["string"]: ["eleven" "twelve"] > > {code} > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-5268) [GLib] Add GArrowJSONReader
[ https://issues.apache.org/jira/browse/ARROW-5268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou resolved ARROW-5268. - Resolution: Fixed Issue resolved by pull request 4263 [https://github.com/apache/arrow/pull/4263] > [GLib] Add GArrowJSONReader > --- > > Key: ARROW-5268 > URL: https://issues.apache.org/jira/browse/ARROW-5268 > Project: Apache Arrow > Issue Type: New Feature > Components: GLib >Reporter: Yosuke Shiro >Assignee: Yosuke Shiro >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > Time Spent: 4h > Remaining Estimate: 0h > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5314) [Go] Incorrect Printing for String Arrays with Offsets
James Walker created ARROW-5314: --- Summary: [Go] Incorrect Printing for String Arrays with Offsets Key: ARROW-5314 URL: https://issues.apache.org/jira/browse/ARROW-5314 Project: Apache Arrow Issue Type: Bug Components: Go Reporter: James Walker If an additional string field is added to the Table Example ([https://github.com/apache/arrow/blob/master/go/arrow/example_test.go#L495-L546)] the Table Reader outputs unexpected results. Modified Table example: {code:java} pool := memory.NewGoAllocator() schema := arrow.NewSchema( []arrow.Field{ arrow.Field{Name: "f1-i32", Type: arrow.PrimitiveTypes.Int32}, arrow.Field{Name: "f2-f64", Type: arrow.PrimitiveTypes.Float64}, arrow.Field{Name: "string", Type: arrow.BinaryTypes.String}, }, nil, ) b := array.NewRecordBuilder(pool, schema) defer b.Release() b.Field(0).(*array.Int32Builder).AppendValues([]int32{1, 2, 3, 4, 5, 6}, nil) b.Field(0).(*array.Int32Builder).AppendValues([]int32{7, 8, 9, 10}, []bool{true, true, false, true}) b.Field(1).(*array.Float64Builder).AppendValues([]float64{1, 2, 3, 4, 5, 6, 7, 8, 9, 10}, nil) b.Field(2).(*array.StringBuilder).AppendValues([]string{ "one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten", }, nil) rec1 := b.NewRecord() defer rec1.Release() b.Field(0).(*array.Int32Builder).AppendValues([]int32{11, 12, 13, 14, 15, 16, 17, 18, 19, 20}, nil) b.Field(1).(*array.Float64Builder).AppendValues([]float64{11, 12, 13, 14, 15, 16, 17, 18, 19, 20}, nil) b.Field(2).(*array.StringBuilder).AppendValues([]string{ "eleven", "twelve", "thirteen", "fourteen", "fifteen", "sixteen", "seventeen", "eighteen", "nineteen", "twenty", }, nil) rec2 := b.NewRecord() defer rec2.Release() tbl := array.NewTableFromRecords(schema, []array.Record{rec1, rec2}) defer tbl.Release() tr := array.NewTableReader(tbl, 2) defer tr.Release() n := 0 for tr.Next() { rec := tr.Record() for i, col := range rec.Columns() { fmt.Printf("rec[%d][%q]: %v\n", n, rec.ColumnName(i), col) } n++ } {code} output: {code:java} rec[0]["f1-i32"]: [1 2] rec[0]["f2-f64"]: [1 2] rec[0]["string"]: ["one" "two"] rec[1]["f1-i32"]: [3 4] rec[1]["f2-f64"]: [3 4] rec[1]["string"]: ["one" "two"] rec[2]["f1-i32"]: [5 6] rec[2]["f2-f64"]: [5 6] rec[2]["string"]: ["one" "two"] rec[3]["f1-i32"]: [7 8] rec[3]["f2-f64"]: [7 8] rec[3]["string"]: ["one" "two"] rec[4]["f1-i32"]: [(null) 10] rec[4]["f2-f64"]: [9 10] rec[4]["string"]: ["one" "two"] rec[5]["f1-i32"]: [11 12] rec[5]["f2-f64"]: [11 12] rec[5]["string"]: ["eleven" "twelve"] rec[6]["f1-i32"]: [13 14] rec[6]["f2-f64"]: [13 14] rec[6]["string"]: ["eleven" "twelve"] rec[7]["f1-i32"]: [15 16] rec[7]["f2-f64"]: [15 16] rec[7]["string"]: ["eleven" "twelve"] rec[8]["f1-i32"]: [17 18] rec[8]["f2-f64"]: [17 18] rec[8]["string"]: ["eleven" "twelve"] rec[9]["f1-i32"]: [19 20] rec[9]["f2-f64"]: [19 20] rec[9]["string"]: ["eleven" "twelve"] {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-5313) [Format] Comments on Field table are a bit confusing
[ https://issues.apache.org/jira/browse/ARROW-5313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-5313: -- Labels: pull-request-available (was: ) > [Format] Comments on Field table are a bit confusing > > > Key: ARROW-5313 > URL: https://issues.apache.org/jira/browse/ARROW-5313 > Project: Apache Arrow > Issue Type: Task > Components: Format >Affects Versions: 0.13.0 >Reporter: Brian Hulette >Assignee: Brian Hulette >Priority: Major > Labels: pull-request-available > > Currently Schema.fbs has two different explanations of {{Field.children}} > One says "children is only for nested Arrow arrays" and the other says > "children apply only to nested data types like Struct, List and Union". I > think both are technically correct but the latter is much more explicit, we > should remove the former. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5313) [Format] Comments on Field table are a bit confusing
Brian Hulette created ARROW-5313: Summary: [Format] Comments on Field table are a bit confusing Key: ARROW-5313 URL: https://issues.apache.org/jira/browse/ARROW-5313 Project: Apache Arrow Issue Type: Task Components: Format Affects Versions: 0.13.0 Reporter: Brian Hulette Assignee: Brian Hulette Currently Schema.fbs has two different explanations of {{Field.children}} One says "children is only for nested Arrow arrays" and the other says "children apply only to nested data types like Struct, List and Union". I think both are technically correct but the latter is much more explicit, we should remove the former. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5312) [C++] Move JSON integration testing utilities to arrow/testing and libarrow_testing.so
Wes McKinney created ARROW-5312: --- Summary: [C++] Move JSON integration testing utilities to arrow/testing and libarrow_testing.so Key: ARROW-5312 URL: https://issues.apache.org/jira/browse/ARROW-5312 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Wes McKinney Fix For: 0.14.0 It's not necessary to have this code in libarrow.so. Let's tackle after ARROW-3144 and ARROW-835 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-5306) [CI] [GLib] Disable GTK-Doc
[ https://issues.apache.org/jira/browse/ARROW-5306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kouhei Sutou resolved ARROW-5306. - Resolution: Fixed Fix Version/s: 0.14.0 Issue resolved by pull request 4299 [https://github.com/apache/arrow/pull/4299] > [CI] [GLib] Disable GTK-Doc > --- > > Key: ARROW-5306 > URL: https://issues.apache.org/jira/browse/ARROW-5306 > Project: Apache Arrow > Issue Type: Bug > Components: Continuous Integration, GLib >Reporter: Yosuke Shiro >Assignee: Yosuke Shiro >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > Time Spent: 10m > Remaining Estimate: 0h > > Travis fails to process documents by GTK-Doc. > [https://travis-ci.org/apache/arrow/jobs/531197944#L4170] > This caused by the recent GTK-Doc upgrade to 0.13.0. So disable GTK-Doc until > 0.13.1 is released. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-5224) [Java] Add APIs for supporting directly serialize/deserialize ValueVector
[ https://issues.apache.org/jira/browse/ARROW-5224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16838761#comment-16838761 ] Bryan Cutler commented on ARROW-5224: - [~tianchen92] could you encode the BigIntVector into a VarBinaryVector as LEB128 and then serialize that vector as an Arrow RecordBatch? > [Java] Add APIs for supporting directly serialize/deserialize ValueVector > - > > Key: ARROW-5224 > URL: https://issues.apache.org/jira/browse/ARROW-5224 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Ji Liu >Assignee: Ji Liu >Priority: Minor > Labels: pull-request-available > Time Spent: 2.5h > Remaining Estimate: 0h > > There is no API to directly serialize/deserialize ValueVector. The only way > to implement this is to put a single FieldVector in VectorSchemaRoot and > convert it to ArrowRecordBatch, and the deserialize process is as well. > Provide a utility class to implement this may be better, I know all > serializations should follow IPC format so that data can be shared between > different Arrow implementations. But for users who only use Java API and want > to do some further optimization, this seem to be no problem and we could > provide them a more option. > This may take some benefits for Java user who only use ValueVector rather > than IPC series classes such as ArrowReordBatch: > * We could do some shuffle optimization such as compression and some > encoding algorithm for numerical type which could greatly improve performance. > * Do serialize/deserialize with the actual buffer size within vector since > the buffer size is power of 2 which is actually bigger than it really need. > * Reduce data conversion(VectorSchemaRoot, ArrowRecordBatch etc) to make it > user-friendly. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-5291) [Python] Add wrapper for "take" kernel on Array
[ https://issues.apache.org/jira/browse/ARROW-5291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-5291. --- Resolution: Fixed Fix Version/s: 0.14.0 Issue resolved by pull request 4281 [https://github.com/apache/arrow/pull/4281] > [Python] Add wrapper for "take" kernel on Array > > > Key: ARROW-5291 > URL: https://issues.apache.org/jira/browse/ARROW-5291 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > Time Spent: 1h 50m > Remaining Estimate: 0h > > Expose the {{take}} kernel (for primitive types, ARROW-2102) on the python > {{Array}} class. Part of ARROW-2667. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-4993) [C++] Display summary at the end of CMake configuration
[ https://issues.apache.org/jira/browse/ARROW-4993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-4993: -- Labels: pull-request-available (was: ) > [C++] Display summary at the end of CMake configuration > --- > > Key: ARROW-4993 > URL: https://issues.apache.org/jira/browse/ARROW-4993 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Affects Versions: 0.12.1 >Reporter: Antoine Pitrou >Priority: Minor > Labels: pull-request-available > Fix For: 0.14.0 > > > Some third-party projects like Thrift display a nice and useful summary of > the build configuration at the end of the CMake configuration run: > https://ci.appveyor.com/project/pitrou/arrow/build/job/mgi68rvk0u5jf2s4?fullLog=true#L2325 > It may be good to have a similar thing in Arrow as well. Bonus points if, for > each configuration item, it says which CMake variable can be used to > influence it. > Something like: > {code} > -- Build ZSTD support: ON [change using ARROW_WITH_ZSTD] > -- Build BZ2 support: OFF [change using ARROW_WITH_BZ2] > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-1012) [C++] Create a configurable implementation of RecordBatchReader that reads from Apache Parquet files
[ https://issues.apache.org/jira/browse/ARROW-1012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-1012: -- Labels: parquet pull-request-available (was: parquet) > [C++] Create a configurable implementation of RecordBatchReader that reads > from Apache Parquet files > > > Key: ARROW-1012 > URL: https://issues.apache.org/jira/browse/ARROW-1012 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Hatem Helal >Priority: Major > Labels: parquet, pull-request-available > Fix For: 0.14.0 > > > This will be enabled by -ARROW-1008.- > A preliminary implementation of an {{arrow::RecordBatchReader}} was added in > PARQUET-1166 but does not support configuring the batch size. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-5272) [C++] [Gandiva] JIT code executed over uninitialized values
[ https://issues.apache.org/jira/browse/ARROW-5272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-5272: -- Labels: pull-request-available (was: ) > [C++] [Gandiva] JIT code executed over uninitialized values > --- > > Key: ARROW-5272 > URL: https://issues.apache.org/jira/browse/ARROW-5272 > Project: Apache Arrow > Issue Type: Bug > Components: C++ - Gandiva >Reporter: Antoine Pitrou >Assignee: Pindikura Ravindra >Priority: Major > Labels: pull-request-available > > When running Gandiva tests with Valgrind, I get the following errors: > {code} > [==] Running 4 tests from 1 test case. > [--] Global test environment set-up. > [--] 4 tests from TestDecimal > [ RUN ] TestDecimal.TestSimple > ==12052== Conditional jump or move depends on uninitialised value(s) > ==12052==at 0x41110D5: ??? > ==12052== > { > >Memcheck:Cond >obj:* > } > ==12052== Conditional jump or move depends on uninitialised value(s) > ==12052==at 0x41110E8: ??? > ==12052== > { > >Memcheck:Cond >obj:* > } > ==12052== Conditional jump or move depends on uninitialised value(s) > ==12052==at 0x44B: ??? > ==12052== > { > >Memcheck:Cond >obj:* > } > ==12052== Conditional jump or move depends on uninitialised value(s) > ==12052==at 0x47B: ??? > ==12052== > { > >Memcheck:Cond >obj:* > } > [ OK ] TestDecimal.TestSimple (16625 ms) > [ RUN ] TestDecimal.TestLiteral > [ OK ] TestDecimal.TestLiteral (3480 ms) > [ RUN ] TestDecimal.TestIfElse > [ OK ] TestDecimal.TestIfElse (2408 ms) > [ RUN ] TestDecimal.TestCompare > [ OK ] TestDecimal.TestCompare (5303 ms) > {code} > I think this is legitimate. Gandiva runs computations over all values, even > when the bitmap indicates a null value. But decimal computations are complex > and involve conditional jumps, hence the error ("Conditional jump or move > depends on uninitialised value(s)"). > [~pravindra] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-5224) [Java] Add APIs for supporting directly serialize/deserialize ValueVector
[ https://issues.apache.org/jira/browse/ARROW-5224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16838665#comment-16838665 ] Micah Kornfield commented on ARROW-5224: For #1, this seems fairly application specific, so I think it would be best to either agree there is interest in supporting this across languages or to have it in a separate library. But others on the mailing list might have separate opinions. Also, do you have benchmarks showing that encoding improves performance or your system? At least in some cases throughput declines and latency goes up due to the extra serialization and deserialization cost on each side of the wire. Lastly, for compression you should be able to get decent compression by using a WriteableByteChannel that compresses things on the way out (e.g. https://github.com/xerial/snappy-java/blob/master/src/main/java/org/xerial/snappy/SnappyFramedOutputStream.java) > [Java] Add APIs for supporting directly serialize/deserialize ValueVector > - > > Key: ARROW-5224 > URL: https://issues.apache.org/jira/browse/ARROW-5224 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Ji Liu >Assignee: Ji Liu >Priority: Minor > Labels: pull-request-available > Time Spent: 2.5h > Remaining Estimate: 0h > > There is no API to directly serialize/deserialize ValueVector. The only way > to implement this is to put a single FieldVector in VectorSchemaRoot and > convert it to ArrowRecordBatch, and the deserialize process is as well. > Provide a utility class to implement this may be better, I know all > serializations should follow IPC format so that data can be shared between > different Arrow implementations. But for users who only use Java API and want > to do some further optimization, this seem to be no problem and we could > provide them a more option. > This may take some benefits for Java user who only use ValueVector rather > than IPC series classes such as ArrowReordBatch: > * We could do some shuffle optimization such as compression and some > encoding algorithm for numerical type which could greatly improve performance. > * Do serialize/deserialize with the actual buffer size within vector since > the buffer size is power of 2 which is actually bigger than it really need. > * Reduce data conversion(VectorSchemaRoot, ArrowRecordBatch etc) to make it > user-friendly. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2981) [C++] Support scripts / documentation for running clang-tidy on codebase
[ https://issues.apache.org/jira/browse/ARROW-2981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16838624#comment-16838624 ] Uwe L. Korn commented on ARROW-2981: [~bkietz] This is the indented behaviour. We also have a check-format command in CMake but not yet exposed via docker-compose. > [C++] Support scripts / documentation for running clang-tidy on codebase > > > Key: ARROW-2981 > URL: https://issues.apache.org/jira/browse/ARROW-2981 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > Time Spent: 40m > Remaining Estimate: 0h > > Related to ARROW-2952, ARROW-2980 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2981) [C++] Support scripts / documentation for running clang-tidy on codebase
[ https://issues.apache.org/jira/browse/ARROW-2981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16838613#comment-16838613 ] Benjamin Kietzman commented on ARROW-2981: -- [~wesmckinn] [~fsaintjacques] Currently, `docker-compose run format` modifies source in place. Is this the intended behavior for that service, and is that the behavior we want for the clang-tidy? Alternatively, do we just want to emit warnings/errors and leave the source unmodified? > [C++] Support scripts / documentation for running clang-tidy on codebase > > > Key: ARROW-2981 > URL: https://issues.apache.org/jira/browse/ARROW-2981 > Project: Apache Arrow > Issue Type: Improvement > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > Time Spent: 40m > Remaining Estimate: 0h > > Related to ARROW-2952, ARROW-2980 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5311) [C++] Return more specific invalid Status in Take kernel
Joris Van den Bossche created ARROW-5311: Summary: [C++] Return more specific invalid Status in Take kernel Key: ARROW-5311 URL: https://issues.apache.org/jira/browse/ARROW-5311 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Joris Van den Bossche Fix For: 0.14.0 Currently the {{Take}} kernel returns generic Invalid Status for certain cases, that could use more specific error: - indices of wrong type (eg floats) -> TypeError instead of Invalid? - out of bounds index -> new IndexError ? >From review in https://github.com/apache/arrow/pull/4281 cc [~bkietz] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-1280) [C++] Implement Fixed Size List type
[ https://issues.apache.org/jira/browse/ARROW-1280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou reassigned ARROW-1280: - Assignee: Benjamin Kietzman > [C++] Implement Fixed Size List type > > > Key: ARROW-1280 > URL: https://issues.apache.org/jira/browse/ARROW-1280 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Assignee: Benjamin Kietzman >Priority: Major > Labels: beginner, pull-request-available > Fix For: 0.14.0 > > Time Spent: 5h 40m > Remaining Estimate: 0h > > At the moment, we only support lists with a variable size per entry. In some > cases, each entry of a list column will have the same number of elements. In > this case, we can use a more effective data structure as well as do certain > optimisations on the operations of this type. To implement this type: > * Describe the memory structure of it in Layout.md > * Add the type to the enums in the C++ code > * Add FixedSizeListArray, FixedSizeListType and FixedSizeListBuilder classes > to the C++ library -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-1280) [C++] Implement Fixed Size List type
[ https://issues.apache.org/jira/browse/ARROW-1280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-1280. --- Resolution: Fixed Issue resolved by pull request 4278 [https://github.com/apache/arrow/pull/4278] > [C++] Implement Fixed Size List type > > > Key: ARROW-1280 > URL: https://issues.apache.org/jira/browse/ARROW-1280 > Project: Apache Arrow > Issue Type: New Feature > Components: C++ >Reporter: Wes McKinney >Priority: Major > Labels: beginner, pull-request-available > Fix For: 0.14.0 > > Time Spent: 5.5h > Remaining Estimate: 0h > > At the moment, we only support lists with a variable size per entry. In some > cases, each entry of a list column will have the same number of elements. In > this case, we can use a more effective data structure as well as do certain > optimisations on the operations of this type. To implement this type: > * Describe the memory structure of it in Layout.md > * Add the type to the enums in the C++ code > * Add FixedSizeListArray, FixedSizeListType and FixedSizeListBuilder classes > to the C++ library -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-4516) [Python] Error while creating a ParquetDataset on a path without `_common_dataset` but with an empty `_tempfile`
[ https://issues.apache.org/jira/browse/ARROW-4516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16838604#comment-16838604 ] Joris Van den Bossche commented on ARROW-4516: -- Similarly to ARROW-1079 / https://github.com/apache/arrow/pull/860 (which filtered out _directories_ that started with an underscore), we might also want to exclude all "private" files, except for the common recognised ones such {{_metadata}} and {{_common_metadata}}. > [Python] Error while creating a ParquetDataset on a path without > `_common_dataset` but with an empty `_tempfile` > > > Key: ARROW-4516 > URL: https://issues.apache.org/jira/browse/ARROW-4516 > Project: Apache Arrow > Issue Type: Bug > Components: Python >Affects Versions: 0.12.0 >Reporter: yogesh garg >Priority: Major > Labels: parquet > Fix For: 0.14.0 > > > I suspect that there's an error in this line of code: > https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L926 > While validating schema in the initialisation of a {{ParquetDataset}}, we > assume that if {{_common_metadata}} file does not exist, the schema should be > inferred from the first piece of that dataset. The first piece, in my > experience, could refer to a file named with an underscore, that does not > necessarily have to contain the schema, and could be an empty file, e.g. > {{_tempfile}}. > {code:bash} > /tmp/pq/ > ├── part1.parquet > └── _tempfile > {code} > This behavior is allowed by the parquet specification, and we should probably > ignore such pieces. > On a cursory look, we could do either of the following. > 1. Choose the first piece with path that does not start with "_" > 2. Sort pieces by name, but put all the "_" pieces later while making the > manifest. > https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L729 > 3. Silently exclude all the files starting with "_" here, but this will need > to be tested: > https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L770 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-5293) [C++] Take kernel on DictionaryArray does not preserve ordered flag
[ https://issues.apache.org/jira/browse/ARROW-5293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche updated ARROW-5293: - Fix Version/s: 0.14.0 > [C++] Take kernel on DictionaryArray does not preserve ordered flag > --- > > Key: ARROW-5293 > URL: https://issues.apache.org/jira/browse/ARROW-5293 > Project: Apache Arrow > Issue Type: Bug > Components: C++ >Reporter: Joris Van den Bossche >Priority: Major > Fix For: 0.14.0 > > > In the Python tests I was adding, this was failing for an ordered > DictionaryArray: > https://github.com/apache/arrow/pull/4281/commits/1f65936e1a06ae415647af7d5c7f54c5937861f6#diff-01b63f189a63c0d4016f2f91370e08fcR92 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5310) [Python] better error message on creating ParquetDataset from empty directory
Joris Van den Bossche created ARROW-5310: Summary: [Python] better error message on creating ParquetDataset from empty directory Key: ARROW-5310 URL: https://issues.apache.org/jira/browse/ARROW-5310 Project: Apache Arrow Issue Type: Bug Components: Python Reporter: Joris Van den Bossche Currently, you get when {{path}} is an existing but empty directory: {code:python} >>> dataset = pq.ParquetDataset(path) --- IndexErrorTraceback (most recent call last) in > 1 dataset = pq.ParquetDataset(path) ~/scipy/repos/arrow/python/pyarrow/parquet.py in __init__(self, path_or_paths, filesystem, schema, metadata, split_row_groups, validate_schema, filters, metadata_nthreads, memory_map) 989 990 if validate_schema: --> 991 self.validate_schemas() 992 993 if filters is not None: ~/scipy/repos/arrow/python/pyarrow/parquet.py in validate_schemas(self) 1025 self.schema = self.common_metadata.schema 1026 else: -> 1027 self.schema = self.pieces[0].get_metadata().schema 1028 elif self.schema is None: 1029 self.schema = self.metadata.schema IndexError: list index out of range {code} That could be a nicer error message. Unless we actually want to allow this? (although I am not sure there are good use cases of empty directories to support this, because from an empty directory we cannot get any schema or metadata information?) It is only failing when validating the schemas, so with {{validate_schema=False}} it actually returns a ParquetDataset object, just with an empty list for {{pieces}} and no schema. So it would be easy to not error when validating the schemas as well for this empty-directory case. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-2572) [Python] Add factory function to create a Table from Columns and Schema.
[ https://issues.apache.org/jira/browse/ARROW-2572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16838562#comment-16838562 ] Antoine Pitrou commented on ARROW-2572: --- [~jorisvandenbossche] > [Python] Add factory function to create a Table from Columns and Schema. > > > Key: ARROW-2572 > URL: https://issues.apache.org/jira/browse/ARROW-2572 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Affects Versions: 0.9.0 >Reporter: Thomas Buhrmann >Priority: Minor > Labels: beginner > Fix For: 0.14.0 > > > At the moment it seems to be impossible in Python to add custom metadata to a > Table or Column. The closest I've come is to create a list of new Fields (by > "appending" metadata to existing Fields), and then creating a new Schema from > these Fields using the Schema factory function. But I can't see how to create > a new table from the existing Columns and my new Schema, which I understand > would be the way to do it in C++? > Essentially, wrappers for the Table's Make(...) functions seem to be missing. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3424) [Python] Improved workflow for loading an arbitrary collection of Parquet files
[ https://issues.apache.org/jira/browse/ARROW-3424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16838559#comment-16838559 ] Wes McKinney commented on ARROW-3424: - Yes, that might work. I think we should hold off until we can migrate this logic into C++, though > [Python] Improved workflow for loading an arbitrary collection of Parquet > files > --- > > Key: ARROW-3424 > URL: https://issues.apache.org/jira/browse/ARROW-3424 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > Labels: parquet > Fix For: 0.14.0 > > > See SO question for use case: > https://stackoverflow.com/questions/52613682/load-multiple-parquet-files-into-dataframe-for-analysis -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-5286) [Python] support Structs in Table.from_pandas given a known schema
[ https://issues.apache.org/jira/browse/ARROW-5286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Antoine Pitrou resolved ARROW-5286. --- Resolution: Fixed Issue resolved by pull request 4297 [https://github.com/apache/arrow/pull/4297] > [Python] support Structs in Table.from_pandas given a known schema > -- > > Key: ARROW-5286 > URL: https://issues.apache.org/jira/browse/ARROW-5286 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > ARROW-2073 implemented creating a StructArray from an array of tuples (in > addition to from dicts). > This works in {{pyarrow.array}} (specifying the proper type): > {code} > In [2]: df = pd.DataFrame({'tuples': [(1, 2), (3, 4)]}) > > > In [3]: struct_type = pa.struct([('a', pa.int64()), ('b', pa.int64())]) > > > In [4]: pa.array(df['tuples'], type=struct_type) > > > Out[4]: > > -- is_valid: all not null > -- child 0 type: int64 > [ > 1, > 3 > ] > -- child 1 type: int64 > [ > 2, > 4 > ] > {code} > But does not yet work when converting a DataFrame to Table while specifying > the type in a schema: > {code} > In [5]: pa.Table.from_pandas(df, schema=pa.schema([('tuples', struct_type)])) > > > --- > KeyError Traceback (most recent call last) > ~/scipy/repos/arrow/python/pyarrow/pandas_compat.py in > get_logical_type(arrow_type) > 68 try: > ---> 69 return logical_type_map[arrow_type.id] > 70 except KeyError: > KeyError: 24 > During handling of the above exception, another exception occurred: > NotImplementedError Traceback (most recent call last) > in > > 1 pa.Table.from_pandas(df, schema=pa.schema([('tuples', struct_type)])) > ~/scipy/repos/arrow/python/pyarrow/table.pxi in > pyarrow.lib.Table.from_pandas() > ~/scipy/repos/arrow/python/pyarrow/pandas_compat.py in > dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe) > 483 metadata = construct_metadata(df, column_names, index_columns, > 484 index_descriptors, preserve_index, > --> 485 types) > 486 return all_names, arrays, metadata > 487 > ~/scipy/repos/arrow/python/pyarrow/pandas_compat.py in construct_metadata(df, > column_names, index_levels, index_descriptors, preserve_index, types) > 207 metadata = get_column_metadata(df[col_name], > name=sanitized_name, > 208arrow_type=arrow_type, > --> 209field_name=sanitized_name) > 210 column_metadata.append(metadata) > 211 > ~/scipy/repos/arrow/python/pyarrow/pandas_compat.py in > get_column_metadata(column, name, arrow_type, field_name) > 149 dict > 150 """ > --> 151 logical_type = get_logical_type(arrow_type) > 152 > 153 string_dtype, extra_metadata = get_extension_dtype_info(column) > ~/scipy/repos/arrow/python/pyarrow/pandas_compat.py in > get_logical_type(arrow_type) > 77 elif isinstance(arrow_type, pa.lib.Decimal128Type): > 78 return 'decimal' > ---> 79 raise NotImplementedError(str(arrow_type)) > 80 > 81 > NotImplementedError: struct > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-5290) [Java] Provide a flag to enable/disable null-checking in vectors' get methods
[ https://issues.apache.org/jira/browse/ARROW-5290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney updated ARROW-5290: Summary: [Java] Provide a flag to enable/disable null-checking in vectors' get methods (was: Provide a flag to enable/disable null-checking in vectors' get methods) > [Java] Provide a flag to enable/disable null-checking in vectors' get methods > - > > Key: ARROW-5290 > URL: https://issues.apache.org/jira/browse/ARROW-5290 > Project: Apache Arrow > Issue Type: New Feature > Components: Java >Reporter: Liya Fan >Assignee: Liya Fan >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > Time Spent: 3h 20m > Remaining Estimate: 0h > > For vector classes, the get method first checks if the value at the given > index is null. If it is not null, the method goes ahead to retrieve the > value. > For some scenarios, the first check is redundant, because the application > code has already checked the null, before calling the get method. This > redundant check may have non-trivial performance overheads. > So we add a flag to enable/disable the null checking, so the user can set the > flag according to their own specific scenario. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (ARROW-5290) Provide a flag to enable/disable null-checking in vectors' get methods
[ https://issues.apache.org/jira/browse/ARROW-5290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wes McKinney resolved ARROW-5290. - Resolution: Fixed Fix Version/s: 0.14.0 Issue resolved by pull request 4288 [https://github.com/apache/arrow/pull/4288] > Provide a flag to enable/disable null-checking in vectors' get methods > -- > > Key: ARROW-5290 > URL: https://issues.apache.org/jira/browse/ARROW-5290 > Project: Apache Arrow > Issue Type: New Feature > Components: Java >Reporter: Liya Fan >Assignee: Liya Fan >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > Time Spent: 3h 20m > Remaining Estimate: 0h > > For vector classes, the get method first checks if the value at the given > index is null. If it is not null, the method goes ahead to retrieve the > value. > For some scenarios, the first check is redundant, because the application > code has already checked the null, before calling the get method. This > redundant check may have non-trivial performance overheads. > So we add a flag to enable/disable the null checking, so the user can set the > flag according to their own specific scenario. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5309) [Python] Add clarifications to Python "append" methods that return new objects
Wes McKinney created ARROW-5309: --- Summary: [Python] Add clarifications to Python "append" methods that return new objects Key: ARROW-5309 URL: https://issues.apache.org/jira/browse/ARROW-5309 Project: Apache Arrow Issue Type: Improvement Components: Python Reporter: Wes McKinney Fix For: 0.14.0 The current docstrings do say that an object is returned but it is not clear in all cases that it is a new object and the original object is left unmodified see example thread https://github.com/apache/arrow/issues/4296 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-5308) [Go] remove deprecated Feather format
[ https://issues.apache.org/jira/browse/ARROW-5308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-5308: -- Labels: pull-request-available (was: ) > [Go] remove deprecated Feather format > - > > Key: ARROW-5308 > URL: https://issues.apache.org/jira/browse/ARROW-5308 > Project: Apache Arrow > Issue Type: Bug > Components: Go >Reporter: Sebastien Binet >Priority: Major > Labels: pull-request-available > > we should probably consider removing the feather format files from the Go > backend. > Feather is deprecated and right now the Go implementation is just the result > of the automatically generated code. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5308) [Go] remove deprecated Feather format
Sebastien Binet created ARROW-5308: -- Summary: [Go] remove deprecated Feather format Key: ARROW-5308 URL: https://issues.apache.org/jira/browse/ARROW-5308 Project: Apache Arrow Issue Type: Bug Components: Go Reporter: Sebastien Binet we should probably consider removing the feather format files from the Go backend. Feather is deprecated and right now the Go implementation is just the result of the automatically generated code. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-5213) [Format] Script for updating various checked-in Flatbuffers files
[ https://issues.apache.org/jira/browse/ARROW-5213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16838516#comment-16838516 ] Sebastien Binet commented on ARROW-5213: FYI, re-generating the Go files is "as simple as": {{$> cd go/arrow}} {{$> go run ./gen-flatbuffers.go}} (but one needs to have a Go SDK available.) > [Format] Script for updating various checked-in Flatbuffers files > - > > Key: ARROW-5213 > URL: https://issues.apache.org/jira/browse/ARROW-5213 > Project: Apache Arrow > Issue Type: Improvement > Components: Format, Go, Rust >Reporter: Wes McKinney >Assignee: Andy Grove >Priority: Major > > Some subprojects have begun checking in generated Flatbuffers files to source > control. This presents a maintainability issue when there are additions or > changes made to the .fbs sources. It would be useful to be able to automate > the update of these files so it doesn't have to happen on a manual / > case-by-case basis -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-5306) [CI] [GLib] Disable GTK-Doc
[ https://issues.apache.org/jira/browse/ARROW-5306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-5306: -- Labels: pull-request-available (was: ) > [CI] [GLib] Disable GTK-Doc > --- > > Key: ARROW-5306 > URL: https://issues.apache.org/jira/browse/ARROW-5306 > Project: Apache Arrow > Issue Type: Bug > Components: Continuous Integration, GLib >Reporter: Yosuke Shiro >Assignee: Yosuke Shiro >Priority: Major > Labels: pull-request-available > > Travis fails to process documents by GTK-Doc. > [https://travis-ci.org/apache/arrow/jobs/531197944#L4170] > This caused by the recent GTK-Doc upgrade to 0.13.0. So disable GTK-Doc until > 0.13.1 is released. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-5306) [CI] [GLib] Disable GTK-Doc
[ https://issues.apache.org/jira/browse/ARROW-5306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yosuke Shiro updated ARROW-5306: Issue Type: Bug (was: New Feature) > [CI] [GLib] Disable GTK-Doc > --- > > Key: ARROW-5306 > URL: https://issues.apache.org/jira/browse/ARROW-5306 > Project: Apache Arrow > Issue Type: Bug > Components: Continuous Integration, GLib >Reporter: Yosuke Shiro >Assignee: Yosuke Shiro >Priority: Major > > Travis fails to process documents by GTK-Doc. > [https://travis-ci.org/apache/arrow/jobs/531197944#L4170] > This caused by the recent GTK-Doc upgrade to 0.13.0. So disable GTK-Doc until > 0.13.1 is released. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-5307) [CI] [GLib] Enable GTK-Doc
[ https://issues.apache.org/jira/browse/ARROW-5307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yosuke Shiro updated ARROW-5307: Issue Type: Improvement (was: New Feature) > [CI] [GLib] Enable GTK-Doc > -- > > Key: ARROW-5307 > URL: https://issues.apache.org/jira/browse/ARROW-5307 > Project: Apache Arrow > Issue Type: Improvement > Components: Continuous Integration, GLib >Reporter: Yosuke Shiro >Priority: Major > > Enable GTK-Doc when 0.13.1 is released. > See https://issues.apache.org/jira/browse/ARROW-5306. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-5307) [CI] [GLib] Enable GTK-Doc
[ https://issues.apache.org/jira/browse/ARROW-5307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yosuke Shiro updated ARROW-5307: Description: Enable GTK-Doc when 0.13.1 is released. See https://issues.apache.org/jira/browse/ARROW-5306. > [CI] [GLib] Enable GTK-Doc > -- > > Key: ARROW-5307 > URL: https://issues.apache.org/jira/browse/ARROW-5307 > Project: Apache Arrow > Issue Type: New Feature > Components: Continuous Integration, GLib >Reporter: Yosuke Shiro >Priority: Major > > Enable GTK-Doc when 0.13.1 is released. > See https://issues.apache.org/jira/browse/ARROW-5306. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5307) [CI] [GLib] Enable GTK-Doc
Yosuke Shiro created ARROW-5307: --- Summary: [CI] [GLib] Enable GTK-Doc Key: ARROW-5307 URL: https://issues.apache.org/jira/browse/ARROW-5307 Project: Apache Arrow Issue Type: New Feature Components: Continuous Integration, GLib Reporter: Yosuke Shiro -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (ARROW-5306) [CI] [GLib] Disable GTK-Doc
Yosuke Shiro created ARROW-5306: --- Summary: [CI] [GLib] Disable GTK-Doc Key: ARROW-5306 URL: https://issues.apache.org/jira/browse/ARROW-5306 Project: Apache Arrow Issue Type: New Feature Components: Continuous Integration, GLib Reporter: Yosuke Shiro Assignee: Yosuke Shiro Travis fails to process documents by GTK-Doc. [https://travis-ci.org/apache/arrow/jobs/531197944#L4170] This caused by the recent GTK-Doc upgrade to 0.13.0. So disable GTK-Doc until 0.13.1 is released. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-5290) Provide a flag to enable/disable null-checking in vectors' get methods
[ https://issues.apache.org/jira/browse/ARROW-5290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liya Fan updated ARROW-5290: Attachment: (was: safe.png) > Provide a flag to enable/disable null-checking in vectors' get methods > -- > > Key: ARROW-5290 > URL: https://issues.apache.org/jira/browse/ARROW-5290 > Project: Apache Arrow > Issue Type: New Feature > Components: Java >Reporter: Liya Fan >Assignee: Liya Fan >Priority: Major > Labels: pull-request-available > Time Spent: 3h 20m > Remaining Estimate: 0h > > For vector classes, the get method first checks if the value at the given > index is null. If it is not null, the method goes ahead to retrieve the > value. > For some scenarios, the first check is redundant, because the application > code has already checked the null, before calling the get method. This > redundant check may have non-trivial performance overheads. > So we add a flag to enable/disable the null checking, so the user can set the > flag according to their own specific scenario. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-3424) [Python] Improved workflow for loading an arbitrary collection of Parquet files
[ https://issues.apache.org/jira/browse/ARROW-3424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16838442#comment-16838442 ] Joris Van den Bossche commented on ARROW-3424: -- Currently, a list of files is already supported in {{ParquetDataset}}. So something like this (that would address the SO question, I think) works: {code:java} dataset = pq.ParquetDataset(['part0.parquet', 'part1.parquet']) dataset.read_pandas().to_pandas() {code} Do we think that is enough support? (if so, this issue can be closed I think) Or do we want to add this to {{pq.read_table}} ? (which eg also accepts a directory name, which is then passed through to {{ParquetDataset}}. We could do a similar pass through for a list of paths) > [Python] Improved workflow for loading an arbitrary collection of Parquet > files > --- > > Key: ARROW-3424 > URL: https://issues.apache.org/jira/browse/ARROW-3424 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Wes McKinney >Priority: Major > Labels: parquet > Fix For: 0.14.0 > > > See SO question for use case: > https://stackoverflow.com/questions/52613682/load-multiple-parquet-files-into-dataframe-for-analysis -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-5286) [Python] support Structs in Table.from_pandas given a known schema
[ https://issues.apache.org/jira/browse/ARROW-5286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated ARROW-5286: -- Labels: pull-request-available (was: ) > [Python] support Structs in Table.from_pandas given a known schema > -- > > Key: ARROW-5286 > URL: https://issues.apache.org/jira/browse/ARROW-5286 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Major > Labels: pull-request-available > Fix For: 0.14.0 > > > ARROW-2073 implemented creating a StructArray from an array of tuples (in > addition to from dicts). > This works in {{pyarrow.array}} (specifying the proper type): > {code} > In [2]: df = pd.DataFrame({'tuples': [(1, 2), (3, 4)]}) > > > In [3]: struct_type = pa.struct([('a', pa.int64()), ('b', pa.int64())]) > > > In [4]: pa.array(df['tuples'], type=struct_type) > > > Out[4]: > > -- is_valid: all not null > -- child 0 type: int64 > [ > 1, > 3 > ] > -- child 1 type: int64 > [ > 2, > 4 > ] > {code} > But does not yet work when converting a DataFrame to Table while specifying > the type in a schema: > {code} > In [5]: pa.Table.from_pandas(df, schema=pa.schema([('tuples', struct_type)])) > > > --- > KeyError Traceback (most recent call last) > ~/scipy/repos/arrow/python/pyarrow/pandas_compat.py in > get_logical_type(arrow_type) > 68 try: > ---> 69 return logical_type_map[arrow_type.id] > 70 except KeyError: > KeyError: 24 > During handling of the above exception, another exception occurred: > NotImplementedError Traceback (most recent call last) > in > > 1 pa.Table.from_pandas(df, schema=pa.schema([('tuples', struct_type)])) > ~/scipy/repos/arrow/python/pyarrow/table.pxi in > pyarrow.lib.Table.from_pandas() > ~/scipy/repos/arrow/python/pyarrow/pandas_compat.py in > dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe) > 483 metadata = construct_metadata(df, column_names, index_columns, > 484 index_descriptors, preserve_index, > --> 485 types) > 486 return all_names, arrays, metadata > 487 > ~/scipy/repos/arrow/python/pyarrow/pandas_compat.py in construct_metadata(df, > column_names, index_levels, index_descriptors, preserve_index, types) > 207 metadata = get_column_metadata(df[col_name], > name=sanitized_name, > 208arrow_type=arrow_type, > --> 209field_name=sanitized_name) > 210 column_metadata.append(metadata) > 211 > ~/scipy/repos/arrow/python/pyarrow/pandas_compat.py in > get_column_metadata(column, name, arrow_type, field_name) > 149 dict > 150 """ > --> 151 logical_type = get_logical_type(arrow_type) > 152 > 153 string_dtype, extra_metadata = get_extension_dtype_info(column) > ~/scipy/repos/arrow/python/pyarrow/pandas_compat.py in > get_logical_type(arrow_type) > 77 elif isinstance(arrow_type, pa.lib.Decimal128Type): > 78 return 'decimal' > ---> 79 raise NotImplementedError(str(arrow_type)) > 80 > 81 > NotImplementedError: struct > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-5286) [Python] support Structs in Table.from_pandas given a known schema
[ https://issues.apache.org/jira/browse/ARROW-5286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16838389#comment-16838389 ] Joris Van den Bossche commented on ARROW-5286: -- Actually, also converting from dicts (without the need to specify the schema) shows the same limitation: it works in {{pa.array(..)}} but not in {{pa.Table.from_pandas(..)}}: {code:java} In [14]: df = pd.DataFrame({'dicts': [{'a': 1, 'b': 2}, {'a': 3, 'b': 4}]}) In [15]: pa.array(df['dicts']) Out[15]: -- is_valid: all not null -- child 0 type: int64 [ 1, 3 ] -- child 1 type: int64 [ 2, 4 ] In [16]: pa.Table.from_pandas(df) ... NotImplementedError: struct{code} > [Python] support Structs in Table.from_pandas given a known schema > -- > > Key: ARROW-5286 > URL: https://issues.apache.org/jira/browse/ARROW-5286 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Major > Fix For: 0.14.0 > > > ARROW-2073 implemented creating a StructArray from an array of tuples (in > addition to from dicts). > This works in {{pyarrow.array}} (specifying the proper type): > {code} > In [2]: df = pd.DataFrame({'tuples': [(1, 2), (3, 4)]}) > > > In [3]: struct_type = pa.struct([('a', pa.int64()), ('b', pa.int64())]) > > > In [4]: pa.array(df['tuples'], type=struct_type) > > > Out[4]: > > -- is_valid: all not null > -- child 0 type: int64 > [ > 1, > 3 > ] > -- child 1 type: int64 > [ > 2, > 4 > ] > {code} > But does not yet work when converting a DataFrame to Table while specifying > the type in a schema: > {code} > In [5]: pa.Table.from_pandas(df, schema=pa.schema([('tuples', struct_type)])) > > > --- > KeyError Traceback (most recent call last) > ~/scipy/repos/arrow/python/pyarrow/pandas_compat.py in > get_logical_type(arrow_type) > 68 try: > ---> 69 return logical_type_map[arrow_type.id] > 70 except KeyError: > KeyError: 24 > During handling of the above exception, another exception occurred: > NotImplementedError Traceback (most recent call last) > in > > 1 pa.Table.from_pandas(df, schema=pa.schema([('tuples', struct_type)])) > ~/scipy/repos/arrow/python/pyarrow/table.pxi in > pyarrow.lib.Table.from_pandas() > ~/scipy/repos/arrow/python/pyarrow/pandas_compat.py in > dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe) > 483 metadata = construct_metadata(df, column_names, index_columns, > 484 index_descriptors, preserve_index, > --> 485 types) > 486 return all_names, arrays, metadata > 487 > ~/scipy/repos/arrow/python/pyarrow/pandas_compat.py in construct_metadata(df, > column_names, index_levels, index_descriptors, preserve_index, types) > 207 metadata = get_column_metadata(df[col_name], > name=sanitized_name, > 208arrow_type=arrow_type, > --> 209field_name=sanitized_name) > 210 column_metadata.append(metadata) > 211 > ~/scipy/repos/arrow/python/pyarrow/pandas_compat.py in > get_column_metadata(column, name, arrow_type, field_name) > 149 dict > 150 """ > --> 151 logical_type = get_logical_type(arrow_type) > 152 > 153 string_dtype, extra_metadata = get_extension_dtype_info(column) > ~/scipy/repos/arrow/python/pyarrow/pandas_compat.py in > get_logical_type(arrow_type) > 77 elif isinstance(arrow_type, pa.lib.Decimal128Type): > 78 return 'decimal' > ---> 79 raise NotImplementedError(str(arrow_type)) > 80 > 81 > NotImplementedError: struct > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (ARROW-5286) [Python] support Structs in Table.from_pandas given a known schema
[ https://issues.apache.org/jira/browse/ARROW-5286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joris Van den Bossche reassigned ARROW-5286: Assignee: Joris Van den Bossche > [Python] support Structs in Table.from_pandas given a known schema > -- > > Key: ARROW-5286 > URL: https://issues.apache.org/jira/browse/ARROW-5286 > Project: Apache Arrow > Issue Type: Improvement > Components: Python >Reporter: Joris Van den Bossche >Assignee: Joris Van den Bossche >Priority: Major > Fix For: 0.14.0 > > > ARROW-2073 implemented creating a StructArray from an array of tuples (in > addition to from dicts). > This works in {{pyarrow.array}} (specifying the proper type): > {code} > In [2]: df = pd.DataFrame({'tuples': [(1, 2), (3, 4)]}) > > > In [3]: struct_type = pa.struct([('a', pa.int64()), ('b', pa.int64())]) > > > In [4]: pa.array(df['tuples'], type=struct_type) > > > Out[4]: > > -- is_valid: all not null > -- child 0 type: int64 > [ > 1, > 3 > ] > -- child 1 type: int64 > [ > 2, > 4 > ] > {code} > But does not yet work when converting a DataFrame to Table while specifying > the type in a schema: > {code} > In [5]: pa.Table.from_pandas(df, schema=pa.schema([('tuples', struct_type)])) > > > --- > KeyError Traceback (most recent call last) > ~/scipy/repos/arrow/python/pyarrow/pandas_compat.py in > get_logical_type(arrow_type) > 68 try: > ---> 69 return logical_type_map[arrow_type.id] > 70 except KeyError: > KeyError: 24 > During handling of the above exception, another exception occurred: > NotImplementedError Traceback (most recent call last) > in > > 1 pa.Table.from_pandas(df, schema=pa.schema([('tuples', struct_type)])) > ~/scipy/repos/arrow/python/pyarrow/table.pxi in > pyarrow.lib.Table.from_pandas() > ~/scipy/repos/arrow/python/pyarrow/pandas_compat.py in > dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe) > 483 metadata = construct_metadata(df, column_names, index_columns, > 484 index_descriptors, preserve_index, > --> 485 types) > 486 return all_names, arrays, metadata > 487 > ~/scipy/repos/arrow/python/pyarrow/pandas_compat.py in construct_metadata(df, > column_names, index_levels, index_descriptors, preserve_index, types) > 207 metadata = get_column_metadata(df[col_name], > name=sanitized_name, > 208arrow_type=arrow_type, > --> 209field_name=sanitized_name) > 210 column_metadata.append(metadata) > 211 > ~/scipy/repos/arrow/python/pyarrow/pandas_compat.py in > get_column_metadata(column, name, arrow_type, field_name) > 149 dict > 150 """ > --> 151 logical_type = get_logical_type(arrow_type) > 152 > 153 string_dtype, extra_metadata = get_extension_dtype_info(column) > ~/scipy/repos/arrow/python/pyarrow/pandas_compat.py in > get_logical_type(arrow_type) > 77 elif isinstance(arrow_type, pa.lib.Decimal128Type): > 78 return 'decimal' > ---> 79 raise NotImplementedError(str(arrow_type)) > 80 > 81 > NotImplementedError: struct > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Issue Comment Deleted] (ARROW-5290) Provide a flag to enable/disable null-checking in vectors' get methods
[ https://issues.apache.org/jira/browse/ARROW-5290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liya Fan updated ARROW-5290: Comment: was deleted (was: The assembly code of the unsafe API) > Provide a flag to enable/disable null-checking in vectors' get methods > -- > > Key: ARROW-5290 > URL: https://issues.apache.org/jira/browse/ARROW-5290 > Project: Apache Arrow > Issue Type: New Feature > Components: Java >Reporter: Liya Fan >Assignee: Liya Fan >Priority: Major > Labels: pull-request-available > Attachments: safe.png > > Time Spent: 2h 40m > Remaining Estimate: 0h > > For vector classes, the get method first checks if the value at the given > index is null. If it is not null, the method goes ahead to retrieve the > value. > For some scenarios, the first check is redundant, because the application > code has already checked the null, before calling the get method. This > redundant check may have non-trivial performance overheads. > So we add a flag to enable/disable the null checking, so the user can set the > flag according to their own specific scenario. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Issue Comment Deleted] (ARROW-5290) Provide a flag to enable/disable null-checking in vectors' get methods
[ https://issues.apache.org/jira/browse/ARROW-5290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liya Fan updated ARROW-5290: Comment: was deleted (was: The assembly code of the safe API (when the null-checking is disabled) !safe.png! ) > Provide a flag to enable/disable null-checking in vectors' get methods > -- > > Key: ARROW-5290 > URL: https://issues.apache.org/jira/browse/ARROW-5290 > Project: Apache Arrow > Issue Type: New Feature > Components: Java >Reporter: Liya Fan >Assignee: Liya Fan >Priority: Major > Labels: pull-request-available > Attachments: safe.png > > Time Spent: 2h 40m > Remaining Estimate: 0h > > For vector classes, the get method first checks if the value at the given > index is null. If it is not null, the method goes ahead to retrieve the > value. > For some scenarios, the first check is redundant, because the application > code has already checked the null, before calling the get method. This > redundant check may have non-trivial performance overheads. > So we add a flag to enable/disable the null checking, so the user can set the > flag according to their own specific scenario. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (ARROW-5224) [Java] Add APIs for supporting directly serialize/deserialize ValueVector
[ https://issues.apache.org/jira/browse/ARROW-5224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16838299#comment-16838299 ] Ji Liu edited comment on ARROW-5224 at 5/13/19 8:08 AM: [~emkornfi...@gmail.com] Thanks for your reply. For #2 you are right. For #1, for example, if we do encoding Int or BigInt type like [https://en.wikipedia.org/wiki/LEB128], we need to read each value and reassemble byte, and the deserialize process as well. Can this be achieved by existing implementation? Besides, is compression supported? was (Author: tianchen92): [~emkornfi...@gmail.com] Thanks for your reply. For #2 you are right. For #1, for example, if we do encoding Int or BigInt type like [https://en.wikipedia.org/wiki/LEB128], we need to read each value and reassemble byte, and the deserialize process as well. Can this be achieved by existing implementation? > [Java] Add APIs for supporting directly serialize/deserialize ValueVector > - > > Key: ARROW-5224 > URL: https://issues.apache.org/jira/browse/ARROW-5224 > Project: Apache Arrow > Issue Type: Improvement >Reporter: Ji Liu >Assignee: Ji Liu >Priority: Minor > Labels: pull-request-available > Time Spent: 2h 20m > Remaining Estimate: 0h > > There is no API to directly serialize/deserialize ValueVector. The only way > to implement this is to put a single FieldVector in VectorSchemaRoot and > convert it to ArrowRecordBatch, and the deserialize process is as well. > Provide a utility class to implement this may be better, I know all > serializations should follow IPC format so that data can be shared between > different Arrow implementations. But for users who only use Java API and want > to do some further optimization, this seem to be no problem and we could > provide them a more option. > This may take some benefits for Java user who only use ValueVector rather > than IPC series classes such as ArrowReordBatch: > * We could do some shuffle optimization such as compression and some > encoding algorithm for numerical type which could greatly improve performance. > * Do serialize/deserialize with the actual buffer size within vector since > the buffer size is power of 2 which is actually bigger than it really need. > * Reduce data conversion(VectorSchemaRoot, ArrowRecordBatch etc) to make it > user-friendly. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-5290) Provide a flag to enable/disable null-checking in vectors' get methods
[ https://issues.apache.org/jira/browse/ARROW-5290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16838319#comment-16838319 ] Liya Fan commented on ARROW-5290: - The assembly code of the unsafe API > Provide a flag to enable/disable null-checking in vectors' get methods > -- > > Key: ARROW-5290 > URL: https://issues.apache.org/jira/browse/ARROW-5290 > Project: Apache Arrow > Issue Type: New Feature > Components: Java >Reporter: Liya Fan >Assignee: Liya Fan >Priority: Major > Labels: pull-request-available > Attachments: safe.png > > Time Spent: 2h > Remaining Estimate: 0h > > For vector classes, the get method first checks if the value at the given > index is null. If it is not null, the method goes ahead to retrieve the > value. > For some scenarios, the first check is redundant, because the application > code has already checked the null, before calling the get method. This > redundant check may have non-trivial performance overheads. > So we add a flag to enable/disable the null checking, so the user can set the > flag according to their own specific scenario. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (ARROW-5290) Provide a flag to enable/disable null-checking in vectors' get methods
[ https://issues.apache.org/jira/browse/ARROW-5290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16838317#comment-16838317 ] Liya Fan commented on ARROW-5290: - The assembly code of the safe API (when the null-checking is disabled) !safe.png! > Provide a flag to enable/disable null-checking in vectors' get methods > -- > > Key: ARROW-5290 > URL: https://issues.apache.org/jira/browse/ARROW-5290 > Project: Apache Arrow > Issue Type: New Feature > Components: Java >Reporter: Liya Fan >Assignee: Liya Fan >Priority: Major > Labels: pull-request-available > Attachments: safe.png > > Time Spent: 2h > Remaining Estimate: 0h > > For vector classes, the get method first checks if the value at the given > index is null. If it is not null, the method goes ahead to retrieve the > value. > For some scenarios, the first check is redundant, because the application > code has already checked the null, before calling the get method. This > redundant check may have non-trivial performance overheads. > So we add a flag to enable/disable the null checking, so the user can set the > flag according to their own specific scenario. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (ARROW-5290) Provide a flag to enable/disable null-checking in vectors' get methods
[ https://issues.apache.org/jira/browse/ARROW-5290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liya Fan updated ARROW-5290: Attachment: safe.png > Provide a flag to enable/disable null-checking in vectors' get methods > -- > > Key: ARROW-5290 > URL: https://issues.apache.org/jira/browse/ARROW-5290 > Project: Apache Arrow > Issue Type: New Feature > Components: Java >Reporter: Liya Fan >Assignee: Liya Fan >Priority: Major > Labels: pull-request-available > Attachments: safe.png > > Time Spent: 2h > Remaining Estimate: 0h > > For vector classes, the get method first checks if the value at the given > index is null. If it is not null, the method goes ahead to retrieve the > value. > For some scenarios, the first check is redundant, because the application > code has already checked the null, before calling the get method. This > redundant check may have non-trivial performance overheads. > So we add a flag to enable/disable the null checking, so the user can set the > flag according to their own specific scenario. -- This message was sent by Atlassian JIRA (v7.6.3#76005)