[jira] [Commented] (ARROW-5270) [C++] Reenable Valgrind on Travis-CI

2019-05-13 Thread Pindikura Ravindra (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16839104#comment-16839104
 ] 

Pindikura Ravindra commented on ARROW-5270:
---

[https://travis-ci.org/apache/arrow/jobs/531878628]

> [C++] Reenable Valgrind on Travis-CI
> 
>
> Key: ARROW-5270
> URL: https://issues.apache.org/jira/browse/ARROW-5270
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Continuous Integration
>Reporter: Antoine Pitrou
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Running Valgrind on Travis-CI was disabled in ARROW-4611 (apparently because 
> of issues within the re2 library).
> We should reenable it at some point in order to exercise the reliability of 
> our C++ code.
> (and/or have a build with another piece of instrumentation enabled such as 
> ASAN)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5270) [C++] Reenable Valgrind on Travis-CI

2019-05-13 Thread Pindikura Ravindra (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5270?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16839103#comment-16839103
 ] 

Pindikura Ravindra commented on ARROW-5270:
---

There are two issues :
 # instructions not recognized by valgrind

=20276== Your program just tried to execute an instruction that Valgrind 
==20276== did not recognise. There are two possible reasons for this. ==20276== 
1. Your program has a bug and erroneously jumped to a non-code ==20276== 
location. If you are running Memcheck and you just saw a ==20276== warning 
about a bad jump, it's probably your program's fault.

2. the re2 issues

I think these are already covered by the suppressions listed in the 
valgrind.supp but they aren't being recognized due to missing symbols in the 
stack. 

When I ran this on my xenial setup without any conda setup, the stacks showed 
up correctly and got suppressed. so, I suspect this is an issue with conda 
binaries.

> [C++] Reenable Valgrind on Travis-CI
> 
>
> Key: ARROW-5270
> URL: https://issues.apache.org/jira/browse/ARROW-5270
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++, Continuous Integration
>Reporter: Antoine Pitrou
>Priority: Critical
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Running Valgrind on Travis-CI was disabled in ARROW-4611 (apparently because 
> of issues within the re2 library).
> We should reenable it at some point in order to exercise the reliability of 
> our C++ code.
> (and/or have a build with another piece of instrumentation enabled such as 
> ASAN)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5272) [C++] [Gandiva] JIT code executed over uninitialized values

2019-05-13 Thread Pindikura Ravindra (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16839090#comment-16839090
 ] 

Pindikura Ravindra commented on ARROW-5272:
---

[~pitrou] I tried this on my xenial setup (on GCE) with the same valgrind 
settings, and wasn't able to reproduce this.

The travis build also didn't show failures in the decimal test

[https://travis-ci.org/apache/arrow/jobs/531878628]

 

Were you using some additional valgrind flags ?

> [C++] [Gandiva] JIT code executed over uninitialized values
> ---
>
> Key: ARROW-5272
> URL: https://issues.apache.org/jira/browse/ARROW-5272
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Gandiva
>Reporter: Antoine Pitrou
>Assignee: Pindikura Ravindra
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When running Gandiva tests with Valgrind, I get the following errors:
> {code}
> [==] Running 4 tests from 1 test case.
> [--] Global test environment set-up.
> [--] 4 tests from TestDecimal
> [ RUN  ] TestDecimal.TestSimple
> ==12052== Conditional jump or move depends on uninitialised value(s)
> ==12052==at 0x41110D5: ???
> ==12052== 
> {
>
>Memcheck:Cond
>obj:*
> }
> ==12052== Conditional jump or move depends on uninitialised value(s)
> ==12052==at 0x41110E8: ???
> ==12052== 
> {
>
>Memcheck:Cond
>obj:*
> }
> ==12052== Conditional jump or move depends on uninitialised value(s)
> ==12052==at 0x44B: ???
> ==12052== 
> {
>
>Memcheck:Cond
>obj:*
> }
> ==12052== Conditional jump or move depends on uninitialised value(s)
> ==12052==at 0x47B: ???
> ==12052== 
> {
>
>Memcheck:Cond
>obj:*
> }
> [   OK ] TestDecimal.TestSimple (16625 ms)
> [ RUN  ] TestDecimal.TestLiteral
> [   OK ] TestDecimal.TestLiteral (3480 ms)
> [ RUN  ] TestDecimal.TestIfElse
> [   OK ] TestDecimal.TestIfElse (2408 ms)
> [ RUN  ] TestDecimal.TestCompare
> [   OK ] TestDecimal.TestCompare (5303 ms)
> {code}
> I think this is legitimate. Gandiva runs computations over all values, even 
> when the bitmap indicates a null value. But decimal computations are complex 
> and involve conditional jumps, hence the error ("Conditional jump or move 
> depends on uninitialised value(s)").
> [~pravindra]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5224) [Java] Add APIs for supporting directly serialize/deserialize ValueVector

2019-05-13 Thread Micah Kornfield (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16839050#comment-16839050
 ] 

Micah Kornfield commented on ARROW-5224:


[~tianchen92] my main concern with this change is that it shouldn't be a 
one-off for java.  If there is utility of these types of on the wire encodings 
we should come up with a supportable way to make them work across language 
implementations.  I think this is important to discuss on the mailing list 
directly (many people filter out JIRA/Pull requests).   Real performance 
numbers/benchmarks would be helpful in making the case to support this.  Also, 
I'm also curious if you measured to doing blackbox compression with something 
like snappy (the link I provided above) to see if there is still benefit of the 
encoding after applying compression, to the entire vector.

If we are going to make encodings supportable we should either extend 
Schema.fbs or use the custom metadata that is already built into the schema 
(https://github.com/apache/arrow/blob/master/format/Schema.fbs#L265) so 
encodings can be communicated across clients.  Again since convention/design 
needs to be agreed upon discussing on the mailing list is important.

I think a utility class to  convert between BigIntVector and encoded 
VarBinaryVector could also be a potentially valuable contribution, but for this 
use-case I think you lose a lot of the value of encoding (you have a 4-byte 
overhead to keep track of the offsets per encoded entry).



> [Java] Add APIs for supporting directly serialize/deserialize ValueVector
> -
>
> Key: ARROW-5224
> URL: https://issues.apache.org/jira/browse/ARROW-5224
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> There is no API to directly serialize/deserialize ValueVector. The only way 
> to implement this is to put a single FieldVector in VectorSchemaRoot and 
> convert it to ArrowRecordBatch, and the deserialize process is as well. 
> Provide a utility class to implement this may be better, I know all 
> serializations should follow IPC format so that data can be shared between 
> different Arrow implementations. But for users who only use Java API and want 
> to do some further optimization, this seem to be no problem and we could 
> provide them a more option.
> This may take some benefits for Java user who only use ValueVector rather 
> than IPC series classes such as ArrowReordBatch:
>  * We could do some shuffle optimization such as compression and some 
> encoding algorithm for numerical type which could greatly improve performance.
>  * Do serialize/deserialize with the actual buffer size within vector since 
> the buffer size is power of 2 which is actually bigger than it really need.
>  * Reduce data conversion(VectorSchemaRoot, ArrowRecordBatch etc) to make it 
> user-friendly.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5224) [Java] Add APIs for supporting directly serialize/deserialize ValueVector

2019-05-13 Thread Ji Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16839020#comment-16839020
 ] 

Ji Liu commented on ARROW-5224:
---

[~emkornfi...@gmail.com] [~bryanc] Thanks for your comments. Sure we have 
tested the performance with encoding Arrow in our application, and it shows 
this will significantly reduce shuffle data with equal or even less E2E time 
(for Int and BigInt type). 

I agree with [~bryanc], we could simply provide a utility class to encode 
BigIntVector into a VarBinaryVector(The only thing I'm worried about is whether 
multiple transformations will result in significant performance overhead). In 
this way, we won‘t break the existing APIs & protocol. I would like to work in 
this way and test the performance as well. If this works fine, we can further 
extend it to other languages.

What do you think?

> [Java] Add APIs for supporting directly serialize/deserialize ValueVector
> -
>
> Key: ARROW-5224
> URL: https://issues.apache.org/jira/browse/ARROW-5224
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> There is no API to directly serialize/deserialize ValueVector. The only way 
> to implement this is to put a single FieldVector in VectorSchemaRoot and 
> convert it to ArrowRecordBatch, and the deserialize process is as well. 
> Provide a utility class to implement this may be better, I know all 
> serializations should follow IPC format so that data can be shared between 
> different Arrow implementations. But for users who only use Java API and want 
> to do some further optimization, this seem to be no problem and we could 
> provide them a more option.
> This may take some benefits for Java user who only use ValueVector rather 
> than IPC series classes such as ArrowReordBatch:
>  * We could do some shuffle optimization such as compression and some 
> encoding algorithm for numerical type which could greatly improve performance.
>  * Do serialize/deserialize with the actual buffer size within vector since 
> the buffer size is power of 2 which is actually bigger than it really need.
>  * Reduce data conversion(VectorSchemaRoot, ArrowRecordBatch etc) to make it 
> user-friendly.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5102) [C++] Reduce header dependencies

2019-05-13 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16838962#comment-16838962
 ] 

Wes McKinney commented on ARROW-5102:
-

I would be in favor of adding a {{StatusBuilder}} API

> [C++] Reduce header dependencies
> 
>
> Key: ARROW-5102
> URL: https://issues.apache.org/jira/browse/ARROW-5102
> Project: Apache Arrow
>  Issue Type: Wish
>  Components: C++
>Affects Versions: 0.13.0
>Reporter: Antoine Pitrou
>Priority: Major
> Fix For: 0.14.0
>
>
> To tame C++ compile times, we should try to reduce the number of heavy 
> dependencies in our .h files.
> Two possible avenues come to mind:
> * avoid including `unordered_map` and friends
> * avoid including C++ stream libraries (such as `iostream`, `ios`, 
> `sstream`...)
> Unfortunately we're currently including `sstream` in `status.h` for some 
> template APIs. We may move those to a separate include file (e.g. 
> `status-builder.h`).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5314) [Go] Incorrect Printing for String Arrays with Offsets

2019-05-13 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-5314:
--
Labels: pull-request-available  (was: )

> [Go] Incorrect Printing for String Arrays with Offsets 
> ---
>
> Key: ARROW-5314
> URL: https://issues.apache.org/jira/browse/ARROW-5314
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Go
>Reporter: James Walker
>Priority: Trivial
>  Labels: pull-request-available
>
> If an additional string field is added to the Table Example 
> ([https://github.com/apache/arrow/blob/master/go/arrow/example_test.go#L495-L546)]
>  the Table Reader outputs unexpected results.
> Modified Table example:
> {code:java}
> pool := memory.NewGoAllocator()
> schema := arrow.NewSchema(
> []arrow.Field{
> arrow.Field{Name: "f1-i32", Type: arrow.PrimitiveTypes.Int32},
> arrow.Field{Name: "f2-f64", Type: arrow.PrimitiveTypes.Float64},
> arrow.Field{Name: "string", Type: arrow.BinaryTypes.String},
> },
> nil,
> )
> b := array.NewRecordBuilder(pool, schema)
> defer b.Release()
> b.Field(0).(*array.Int32Builder).AppendValues([]int32{1, 2, 3, 4, 5, 6}, nil)
> b.Field(0).(*array.Int32Builder).AppendValues([]int32{7, 8, 9, 10}, 
> []bool{true, true, false, true})
> b.Field(1).(*array.Float64Builder).AppendValues([]float64{1, 2, 3, 4, 5, 6, 
> 7, 8, 9, 10}, nil)
> b.Field(2).(*array.StringBuilder).AppendValues([]string{
> "one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten",
> }, nil)
> rec1 := b.NewRecord()
> defer rec1.Release()
> b.Field(0).(*array.Int32Builder).AppendValues([]int32{11, 12, 13, 14, 15, 16, 
> 17, 18, 19, 20}, nil)
> b.Field(1).(*array.Float64Builder).AppendValues([]float64{11, 12, 13, 14, 15, 
> 16, 17, 18, 19, 20}, nil)
> b.Field(2).(*array.StringBuilder).AppendValues([]string{
> "eleven", "twelve", "thirteen", "fourteen", "fifteen", "sixteen", 
> "seventeen", "eighteen", "nineteen", "twenty",
> }, nil)
> rec2 := b.NewRecord()
> defer rec2.Release()
> tbl := array.NewTableFromRecords(schema, []array.Record{rec1, rec2})
> defer tbl.Release()
> tr := array.NewTableReader(tbl, 2)
> defer tr.Release()
> n := 0
> for tr.Next() {
> rec := tr.Record()
> for i, col := range rec.Columns() {
> fmt.Printf("rec[%d][%q]: %v\n", n, rec.ColumnName(i), col)
> }
> n++
> }
> {code}
>  
> output:
> {code:java}
> rec[0]["f1-i32"]: [1 2]
> rec[0]["f2-f64"]: [1 2]
> rec[0]["string"]: ["one" "two"]
> rec[1]["f1-i32"]: [3 4]
> rec[1]["f2-f64"]: [3 4]
> rec[1]["string"]: ["one" "two"]
> rec[2]["f1-i32"]: [5 6]
> rec[2]["f2-f64"]: [5 6]
> rec[2]["string"]: ["one" "two"]
> rec[3]["f1-i32"]: [7 8]
> rec[3]["f2-f64"]: [7 8]
> rec[3]["string"]: ["one" "two"]
> rec[4]["f1-i32"]: [(null) 10]
> rec[4]["f2-f64"]: [9 10]
> rec[4]["string"]: ["one" "two"]
> rec[5]["f1-i32"]: [11 12]
> rec[5]["f2-f64"]: [11 12]
> rec[5]["string"]: ["eleven" "twelve"]
> rec[6]["f1-i32"]: [13 14]
> rec[6]["f2-f64"]: [13 14]
> rec[6]["string"]: ["eleven" "twelve"]
> rec[7]["f1-i32"]: [15 16]
> rec[7]["f2-f64"]: [15 16]
> rec[7]["string"]: ["eleven" "twelve"]
> rec[8]["f1-i32"]: [17 18]
> rec[8]["f2-f64"]: [17 18]
> rec[8]["string"]: ["eleven" "twelve"]
> rec[9]["f1-i32"]: [19 20]
> rec[9]["f2-f64"]: [19 20]
> rec[9]["string"]: ["eleven" "twelve"]
>  
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-5268) [GLib] Add GArrowJSONReader

2019-05-13 Thread Kouhei Sutou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-5268.
-
Resolution: Fixed

Issue resolved by pull request 4263
[https://github.com/apache/arrow/pull/4263]

> [GLib] Add GArrowJSONReader
> ---
>
> Key: ARROW-5268
> URL: https://issues.apache.org/jira/browse/ARROW-5268
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: GLib
>Reporter: Yosuke Shiro
>Assignee: Yosuke Shiro
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 4h
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5314) [Go] Incorrect Printing for String Arrays with Offsets

2019-05-13 Thread James Walker (JIRA)
James Walker created ARROW-5314:
---

 Summary: [Go] Incorrect Printing for String Arrays with Offsets 
 Key: ARROW-5314
 URL: https://issues.apache.org/jira/browse/ARROW-5314
 Project: Apache Arrow
  Issue Type: Bug
  Components: Go
Reporter: James Walker


If an additional string field is added to the Table Example 
([https://github.com/apache/arrow/blob/master/go/arrow/example_test.go#L495-L546)]
 the Table Reader outputs unexpected results.

Modified Table example:
{code:java}
pool := memory.NewGoAllocator()

schema := arrow.NewSchema(
[]arrow.Field{
arrow.Field{Name: "f1-i32", Type: arrow.PrimitiveTypes.Int32},
arrow.Field{Name: "f2-f64", Type: arrow.PrimitiveTypes.Float64},
arrow.Field{Name: "string", Type: arrow.BinaryTypes.String},
},
nil,
)

b := array.NewRecordBuilder(pool, schema)
defer b.Release()

b.Field(0).(*array.Int32Builder).AppendValues([]int32{1, 2, 3, 4, 5, 6}, nil)
b.Field(0).(*array.Int32Builder).AppendValues([]int32{7, 8, 9, 10}, 
[]bool{true, true, false, true})
b.Field(1).(*array.Float64Builder).AppendValues([]float64{1, 2, 3, 4, 5, 6, 7, 
8, 9, 10}, nil)
b.Field(2).(*array.StringBuilder).AppendValues([]string{
"one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten",
}, nil)

rec1 := b.NewRecord()
defer rec1.Release()

b.Field(0).(*array.Int32Builder).AppendValues([]int32{11, 12, 13, 14, 15, 16, 
17, 18, 19, 20}, nil)
b.Field(1).(*array.Float64Builder).AppendValues([]float64{11, 12, 13, 14, 15, 
16, 17, 18, 19, 20}, nil)
b.Field(2).(*array.StringBuilder).AppendValues([]string{
"eleven", "twelve", "thirteen", "fourteen", "fifteen", "sixteen", "seventeen", 
"eighteen", "nineteen", "twenty",
}, nil)

rec2 := b.NewRecord()
defer rec2.Release()

tbl := array.NewTableFromRecords(schema, []array.Record{rec1, rec2})
defer tbl.Release()

tr := array.NewTableReader(tbl, 2)
defer tr.Release()

n := 0
for tr.Next() {
rec := tr.Record()
for i, col := range rec.Columns() {
fmt.Printf("rec[%d][%q]: %v\n", n, rec.ColumnName(i), col)
}
n++
}
{code}
 

output:
{code:java}
rec[0]["f1-i32"]: [1 2]
rec[0]["f2-f64"]: [1 2]
rec[0]["string"]: ["one" "two"]
rec[1]["f1-i32"]: [3 4]
rec[1]["f2-f64"]: [3 4]
rec[1]["string"]: ["one" "two"]
rec[2]["f1-i32"]: [5 6]
rec[2]["f2-f64"]: [5 6]
rec[2]["string"]: ["one" "two"]
rec[3]["f1-i32"]: [7 8]
rec[3]["f2-f64"]: [7 8]
rec[3]["string"]: ["one" "two"]
rec[4]["f1-i32"]: [(null) 10]
rec[4]["f2-f64"]: [9 10]
rec[4]["string"]: ["one" "two"]
rec[5]["f1-i32"]: [11 12]
rec[5]["f2-f64"]: [11 12]
rec[5]["string"]: ["eleven" "twelve"]
rec[6]["f1-i32"]: [13 14]
rec[6]["f2-f64"]: [13 14]
rec[6]["string"]: ["eleven" "twelve"]
rec[7]["f1-i32"]: [15 16]
rec[7]["f2-f64"]: [15 16]
rec[7]["string"]: ["eleven" "twelve"]
rec[8]["f1-i32"]: [17 18]
rec[8]["f2-f64"]: [17 18]
rec[8]["string"]: ["eleven" "twelve"]
rec[9]["f1-i32"]: [19 20]
rec[9]["f2-f64"]: [19 20]
rec[9]["string"]: ["eleven" "twelve"]
 
{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5313) [Format] Comments on Field table are a bit confusing

2019-05-13 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-5313:
--
Labels: pull-request-available  (was: )

> [Format] Comments on Field table are a bit confusing
> 
>
> Key: ARROW-5313
> URL: https://issues.apache.org/jira/browse/ARROW-5313
> Project: Apache Arrow
>  Issue Type: Task
>  Components: Format
>Affects Versions: 0.13.0
>Reporter: Brian Hulette
>Assignee: Brian Hulette
>Priority: Major
>  Labels: pull-request-available
>
> Currently Schema.fbs has two different explanations of {{Field.children}}
> One says "children is only for nested Arrow arrays" and the other says 
> "children apply only to nested data types like Struct, List and Union". I 
> think both are technically correct but the latter is much more explicit, we 
> should remove the former.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5313) [Format] Comments on Field table are a bit confusing

2019-05-13 Thread Brian Hulette (JIRA)
Brian Hulette created ARROW-5313:


 Summary: [Format] Comments on Field table are a bit confusing
 Key: ARROW-5313
 URL: https://issues.apache.org/jira/browse/ARROW-5313
 Project: Apache Arrow
  Issue Type: Task
  Components: Format
Affects Versions: 0.13.0
Reporter: Brian Hulette
Assignee: Brian Hulette


Currently Schema.fbs has two different explanations of {{Field.children}}

One says "children is only for nested Arrow arrays" and the other says 
"children apply only to nested data types like Struct, List and Union". I think 
both are technically correct but the latter is much more explicit, we should 
remove the former.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5312) [C++] Move JSON integration testing utilities to arrow/testing and libarrow_testing.so

2019-05-13 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-5312:
---

 Summary: [C++] Move JSON integration testing utilities to 
arrow/testing and libarrow_testing.so
 Key: ARROW-5312
 URL: https://issues.apache.org/jira/browse/ARROW-5312
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.14.0


It's not necessary to have this code in libarrow.so. Let's tackle after 
ARROW-3144 and ARROW-835



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-5306) [CI] [GLib] Disable GTK-Doc

2019-05-13 Thread Kouhei Sutou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kouhei Sutou resolved ARROW-5306.
-
   Resolution: Fixed
Fix Version/s: 0.14.0

Issue resolved by pull request 4299
[https://github.com/apache/arrow/pull/4299]

> [CI] [GLib] Disable GTK-Doc
> ---
>
> Key: ARROW-5306
> URL: https://issues.apache.org/jira/browse/ARROW-5306
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, GLib
>Reporter: Yosuke Shiro
>Assignee: Yosuke Shiro
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Travis fails to process documents by GTK-Doc.
> [https://travis-ci.org/apache/arrow/jobs/531197944#L4170]
> This caused by the recent GTK-Doc upgrade to 0.13.0. So disable GTK-Doc until 
> 0.13.1 is released.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5224) [Java] Add APIs for supporting directly serialize/deserialize ValueVector

2019-05-13 Thread Bryan Cutler (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16838761#comment-16838761
 ] 

Bryan Cutler commented on ARROW-5224:
-

[~tianchen92] could you encode the BigIntVector into a VarBinaryVector as 
LEB128 and then serialize that vector as an Arrow RecordBatch?

> [Java] Add APIs for supporting directly serialize/deserialize ValueVector
> -
>
> Key: ARROW-5224
> URL: https://issues.apache.org/jira/browse/ARROW-5224
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> There is no API to directly serialize/deserialize ValueVector. The only way 
> to implement this is to put a single FieldVector in VectorSchemaRoot and 
> convert it to ArrowRecordBatch, and the deserialize process is as well. 
> Provide a utility class to implement this may be better, I know all 
> serializations should follow IPC format so that data can be shared between 
> different Arrow implementations. But for users who only use Java API and want 
> to do some further optimization, this seem to be no problem and we could 
> provide them a more option.
> This may take some benefits for Java user who only use ValueVector rather 
> than IPC series classes such as ArrowReordBatch:
>  * We could do some shuffle optimization such as compression and some 
> encoding algorithm for numerical type which could greatly improve performance.
>  * Do serialize/deserialize with the actual buffer size within vector since 
> the buffer size is power of 2 which is actually bigger than it really need.
>  * Reduce data conversion(VectorSchemaRoot, ArrowRecordBatch etc) to make it 
> user-friendly.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-5291) [Python] Add wrapper for "take" kernel on Array

2019-05-13 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5291?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-5291.
---
   Resolution: Fixed
Fix Version/s: 0.14.0

Issue resolved by pull request 4281
[https://github.com/apache/arrow/pull/4281]

> [Python] Add wrapper for "take" kernel on Array 
> 
>
> Key: ARROW-5291
> URL: https://issues.apache.org/jira/browse/ARROW-5291
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 1h 50m
>  Remaining Estimate: 0h
>
> Expose the {{take}} kernel (for primitive types, ARROW-2102) on the python 
> {{Array}} class. Part of ARROW-2667.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-4993) [C++] Display summary at the end of CMake configuration

2019-05-13 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-4993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-4993:
--
Labels: pull-request-available  (was: )

> [C++] Display summary at the end of CMake configuration
> ---
>
> Key: ARROW-4993
> URL: https://issues.apache.org/jira/browse/ARROW-4993
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Affects Versions: 0.12.1
>Reporter: Antoine Pitrou
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> Some third-party projects like Thrift display a nice and useful summary of 
> the build configuration at the end of the CMake configuration run:
> https://ci.appveyor.com/project/pitrou/arrow/build/job/mgi68rvk0u5jf2s4?fullLog=true#L2325
> It may be good to have a similar thing in Arrow as well. Bonus points if, for 
> each configuration item, it says which CMake variable can be used to 
> influence it.
> Something like:
> {code}
> -- Build ZSTD support: ON  [change using ARROW_WITH_ZSTD]
> -- Build BZ2 support:  OFF [change using ARROW_WITH_BZ2]
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-1012) [C++] Create a configurable implementation of RecordBatchReader that reads from Apache Parquet files

2019-05-13 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-1012:
--
Labels: parquet pull-request-available  (was: parquet)

> [C++] Create a configurable implementation of RecordBatchReader that reads 
> from Apache Parquet files
> 
>
> Key: ARROW-1012
> URL: https://issues.apache.org/jira/browse/ARROW-1012
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Hatem Helal
>Priority: Major
>  Labels: parquet, pull-request-available
> Fix For: 0.14.0
>
>
> This will be enabled by -ARROW-1008.-
> A preliminary implementation of an {{arrow::RecordBatchReader}} was added in 
> PARQUET-1166 but does not support configuring the batch size.  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5272) [C++] [Gandiva] JIT code executed over uninitialized values

2019-05-13 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-5272:
--
Labels: pull-request-available  (was: )

> [C++] [Gandiva] JIT code executed over uninitialized values
> ---
>
> Key: ARROW-5272
> URL: https://issues.apache.org/jira/browse/ARROW-5272
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++ - Gandiva
>Reporter: Antoine Pitrou
>Assignee: Pindikura Ravindra
>Priority: Major
>  Labels: pull-request-available
>
> When running Gandiva tests with Valgrind, I get the following errors:
> {code}
> [==] Running 4 tests from 1 test case.
> [--] Global test environment set-up.
> [--] 4 tests from TestDecimal
> [ RUN  ] TestDecimal.TestSimple
> ==12052== Conditional jump or move depends on uninitialised value(s)
> ==12052==at 0x41110D5: ???
> ==12052== 
> {
>
>Memcheck:Cond
>obj:*
> }
> ==12052== Conditional jump or move depends on uninitialised value(s)
> ==12052==at 0x41110E8: ???
> ==12052== 
> {
>
>Memcheck:Cond
>obj:*
> }
> ==12052== Conditional jump or move depends on uninitialised value(s)
> ==12052==at 0x44B: ???
> ==12052== 
> {
>
>Memcheck:Cond
>obj:*
> }
> ==12052== Conditional jump or move depends on uninitialised value(s)
> ==12052==at 0x47B: ???
> ==12052== 
> {
>
>Memcheck:Cond
>obj:*
> }
> [   OK ] TestDecimal.TestSimple (16625 ms)
> [ RUN  ] TestDecimal.TestLiteral
> [   OK ] TestDecimal.TestLiteral (3480 ms)
> [ RUN  ] TestDecimal.TestIfElse
> [   OK ] TestDecimal.TestIfElse (2408 ms)
> [ RUN  ] TestDecimal.TestCompare
> [   OK ] TestDecimal.TestCompare (5303 ms)
> {code}
> I think this is legitimate. Gandiva runs computations over all values, even 
> when the bitmap indicates a null value. But decimal computations are complex 
> and involve conditional jumps, hence the error ("Conditional jump or move 
> depends on uninitialised value(s)").
> [~pravindra]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5224) [Java] Add APIs for supporting directly serialize/deserialize ValueVector

2019-05-13 Thread Micah Kornfield (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16838665#comment-16838665
 ] 

Micah Kornfield commented on ARROW-5224:


For #1, this seems fairly application specific, so I think it would be best to 
either agree there is interest in supporting this across languages or to have 
it in a separate library.   But others on the mailing list might have separate 
opinions.  Also, do you have benchmarks showing that encoding improves 
performance or your system?  At least in some cases throughput declines and 
latency goes up due to the extra serialization and deserialization cost on each 
side of the wire.  Lastly, for compression you should be able to get decent 
compression by using a WriteableByteChannel that compresses things on the way 
out (e.g. 
https://github.com/xerial/snappy-java/blob/master/src/main/java/org/xerial/snappy/SnappyFramedOutputStream.java)

> [Java] Add APIs for supporting directly serialize/deserialize ValueVector
> -
>
> Key: ARROW-5224
> URL: https://issues.apache.org/jira/browse/ARROW-5224
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> There is no API to directly serialize/deserialize ValueVector. The only way 
> to implement this is to put a single FieldVector in VectorSchemaRoot and 
> convert it to ArrowRecordBatch, and the deserialize process is as well. 
> Provide a utility class to implement this may be better, I know all 
> serializations should follow IPC format so that data can be shared between 
> different Arrow implementations. But for users who only use Java API and want 
> to do some further optimization, this seem to be no problem and we could 
> provide them a more option.
> This may take some benefits for Java user who only use ValueVector rather 
> than IPC series classes such as ArrowReordBatch:
>  * We could do some shuffle optimization such as compression and some 
> encoding algorithm for numerical type which could greatly improve performance.
>  * Do serialize/deserialize with the actual buffer size within vector since 
> the buffer size is power of 2 which is actually bigger than it really need.
>  * Reduce data conversion(VectorSchemaRoot, ArrowRecordBatch etc) to make it 
> user-friendly.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2981) [C++] Support scripts / documentation for running clang-tidy on codebase

2019-05-13 Thread Uwe L. Korn (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16838624#comment-16838624
 ] 

Uwe L. Korn commented on ARROW-2981:


[~bkietz] This is the indented behaviour. We also have a check-format command 
in CMake but not yet exposed via docker-compose.

> [C++] Support scripts / documentation for running clang-tidy on codebase
> 
>
> Key: ARROW-2981
> URL: https://issues.apache.org/jira/browse/ARROW-2981
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Related to ARROW-2952, ARROW-2980



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2981) [C++] Support scripts / documentation for running clang-tidy on codebase

2019-05-13 Thread Benjamin Kietzman (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2981?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16838613#comment-16838613
 ] 

Benjamin Kietzman commented on ARROW-2981:
--

[~wesmckinn] [~fsaintjacques] Currently, `docker-compose run format` modifies 
source in place. Is this the intended behavior for that service, and is that 
the behavior we want for the clang-tidy? Alternatively, do we just want to emit 
warnings/errors and leave the source unmodified?

> [C++] Support scripts / documentation for running clang-tidy on codebase
> 
>
> Key: ARROW-2981
> URL: https://issues.apache.org/jira/browse/ARROW-2981
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 40m
>  Remaining Estimate: 0h
>
> Related to ARROW-2952, ARROW-2980



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5311) [C++] Return more specific invalid Status in Take kernel

2019-05-13 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-5311:


 Summary: [C++] Return more specific invalid Status in Take kernel
 Key: ARROW-5311
 URL: https://issues.apache.org/jira/browse/ARROW-5311
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Joris Van den Bossche
 Fix For: 0.14.0


Currently the {{Take}} kernel returns generic Invalid Status for certain cases, 
that could use more specific error:

- indices of wrong type (eg floats) -> TypeError instead of Invalid?
- out of bounds index -> new IndexError ?

>From review in https://github.com/apache/arrow/pull/4281

cc [~bkietz]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-1280) [C++] Implement Fixed Size List type

2019-05-13 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou reassigned ARROW-1280:
-

Assignee: Benjamin Kietzman

> [C++] Implement Fixed Size List type
> 
>
> Key: ARROW-1280
> URL: https://issues.apache.org/jira/browse/ARROW-1280
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Assignee: Benjamin Kietzman
>Priority: Major
>  Labels: beginner, pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 5h 40m
>  Remaining Estimate: 0h
>
> At the moment, we only support lists with a variable size per entry. In some 
> cases, each entry of a list column will have the same number of elements. In 
> this case, we can use a more effective data structure as well as do certain 
> optimisations on the operations of this type. To implement this type:
> * Describe the memory structure of it in Layout.md
> * Add the type to the enums in the C++ code
> * Add FixedSizeListArray, FixedSizeListType and FixedSizeListBuilder classes 
> to the C++ library



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-1280) [C++] Implement Fixed Size List type

2019-05-13 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-1280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-1280.
---
Resolution: Fixed

Issue resolved by pull request 4278
[https://github.com/apache/arrow/pull/4278]

> [C++] Implement Fixed Size List type
> 
>
> Key: ARROW-1280
> URL: https://issues.apache.org/jira/browse/ARROW-1280
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: C++
>Reporter: Wes McKinney
>Priority: Major
>  Labels: beginner, pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 5.5h
>  Remaining Estimate: 0h
>
> At the moment, we only support lists with a variable size per entry. In some 
> cases, each entry of a list column will have the same number of elements. In 
> this case, we can use a more effective data structure as well as do certain 
> optimisations on the operations of this type. To implement this type:
> * Describe the memory structure of it in Layout.md
> * Add the type to the enums in the C++ code
> * Add FixedSizeListArray, FixedSizeListType and FixedSizeListBuilder classes 
> to the C++ library



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-4516) [Python] Error while creating a ParquetDataset on a path without `_common_dataset` but with an empty `_tempfile`

2019-05-13 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-4516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16838604#comment-16838604
 ] 

Joris Van den Bossche commented on ARROW-4516:
--

Similarly to ARROW-1079 / https://github.com/apache/arrow/pull/860 (which 
filtered out _directories_ that started with an underscore), we might also want 
to exclude all "private" files, except for the common recognised ones such 
{{_metadata}} and {{_common_metadata}}.


 

> [Python] Error while creating a ParquetDataset on a path without 
> `_common_dataset` but with an empty `_tempfile`
> 
>
> Key: ARROW-4516
> URL: https://issues.apache.org/jira/browse/ARROW-4516
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Python
>Affects Versions: 0.12.0
>Reporter: yogesh garg
>Priority: Major
>  Labels: parquet
> Fix For: 0.14.0
>
>
> I suspect that there's an error in this line of code:
> https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L926
> While validating schema in the initialisation of a {{ParquetDataset}}, we 
> assume that if {{_common_metadata}} file does not exist, the schema should be 
> inferred from the first piece of that dataset. The first piece, in my 
> experience, could refer to a file named with an underscore, that does not 
> necessarily have to contain the schema, and could be an empty file, e.g. 
> {{_tempfile}}.
> {code:bash}
> /tmp/pq/
> ├── part1.parquet
> └── _tempfile
> {code}
> This behavior is allowed by the parquet specification, and we should probably 
> ignore such pieces.
> On a cursory look, we could do either of the following.
> 1. Choose the first piece with path that does not start with "_"
> 2. Sort pieces by name, but put all the "_" pieces later while making the 
> manifest. 
> https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L729
> 3. Silently exclude all the files starting with "_" here, but this will need 
> to be tested: 
> https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L770



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5293) [C++] Take kernel on DictionaryArray does not preserve ordered flag

2019-05-13 Thread Joris Van den Bossche (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-5293:
-
Fix Version/s: 0.14.0

> [C++] Take kernel on DictionaryArray does not preserve ordered flag
> ---
>
> Key: ARROW-5293
> URL: https://issues.apache.org/jira/browse/ARROW-5293
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: C++
>Reporter: Joris Van den Bossche
>Priority: Major
> Fix For: 0.14.0
>
>
> In the Python tests I was adding, this was failing for an ordered 
> DictionaryArray: 
> https://github.com/apache/arrow/pull/4281/commits/1f65936e1a06ae415647af7d5c7f54c5937861f6#diff-01b63f189a63c0d4016f2f91370e08fcR92



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5310) [Python] better error message on creating ParquetDataset from empty directory

2019-05-13 Thread Joris Van den Bossche (JIRA)
Joris Van den Bossche created ARROW-5310:


 Summary: [Python] better error message on creating ParquetDataset 
from empty directory
 Key: ARROW-5310
 URL: https://issues.apache.org/jira/browse/ARROW-5310
 Project: Apache Arrow
  Issue Type: Bug
  Components: Python
Reporter: Joris Van den Bossche


Currently, you get when {{path}} is an existing but empty directory:

{code:python}
>>> dataset = pq.ParquetDataset(path)
---
IndexErrorTraceback (most recent call last)
 in 
> 1 dataset = pq.ParquetDataset(path)

~/scipy/repos/arrow/python/pyarrow/parquet.py in __init__(self, path_or_paths, 
filesystem, schema, metadata, split_row_groups, validate_schema, filters, 
metadata_nthreads, memory_map)
989 
990 if validate_schema:
--> 991 self.validate_schemas()
992 
993 if filters is not None:

~/scipy/repos/arrow/python/pyarrow/parquet.py in validate_schemas(self)
   1025 self.schema = self.common_metadata.schema
   1026 else:
-> 1027 self.schema = self.pieces[0].get_metadata().schema
   1028 elif self.schema is None:
   1029 self.schema = self.metadata.schema

IndexError: list index out of range
{code}

That could be a nicer error message. 

Unless we actually want to allow this? (although I am not sure there are good 
use cases of empty directories to support this, because from an empty directory 
we cannot get any schema or metadata information?) 
It is only failing when validating the schemas, so with 
{{validate_schema=False}} it actually returns a ParquetDataset object, just 
with an empty list for {{pieces}} and no schema. So it would be easy to not 
error when validating the schemas as well for this empty-directory case.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-2572) [Python] Add factory function to create a Table from Columns and Schema.

2019-05-13 Thread Antoine Pitrou (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-2572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16838562#comment-16838562
 ] 

Antoine Pitrou commented on ARROW-2572:
---

[~jorisvandenbossche]

> [Python] Add factory function to create a Table from Columns and Schema.
> 
>
> Key: ARROW-2572
> URL: https://issues.apache.org/jira/browse/ARROW-2572
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Affects Versions: 0.9.0
>Reporter: Thomas Buhrmann
>Priority: Minor
>  Labels: beginner
> Fix For: 0.14.0
>
>
> At the moment it seems to be impossible in Python to add custom metadata to a 
> Table or Column. The closest I've come is to create a list of new Fields (by 
> "appending" metadata to existing Fields), and then creating a new Schema from 
> these Fields using the Schema factory function. But I can't see how to create 
> a new table from the existing Columns and my new Schema, which I understand 
> would be the way to do it in C++?
> Essentially, wrappers for the Table's Make(...) functions seem to be missing.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3424) [Python] Improved workflow for loading an arbitrary collection of Parquet files

2019-05-13 Thread Wes McKinney (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16838559#comment-16838559
 ] 

Wes McKinney commented on ARROW-3424:
-

Yes, that might work. I think we should hold off until we can migrate this 
logic into C++, though

> [Python] Improved workflow for loading an arbitrary collection of Parquet 
> files
> ---
>
> Key: ARROW-3424
> URL: https://issues.apache.org/jira/browse/ARROW-3424
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: parquet
> Fix For: 0.14.0
>
>
> See SO question for use case: 
> https://stackoverflow.com/questions/52613682/load-multiple-parquet-files-into-dataframe-for-analysis



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-5286) [Python] support Structs in Table.from_pandas given a known schema

2019-05-13 Thread Antoine Pitrou (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Antoine Pitrou resolved ARROW-5286.
---
Resolution: Fixed

Issue resolved by pull request 4297
[https://github.com/apache/arrow/pull/4297]

> [Python] support Structs in Table.from_pandas given a known schema
> --
>
> Key: ARROW-5286
> URL: https://issues.apache.org/jira/browse/ARROW-5286
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> ARROW-2073 implemented creating a StructArray from an array of tuples (in 
> addition to from dicts). 
> This works in {{pyarrow.array}} (specifying the proper type):
> {code}
> In [2]: df = pd.DataFrame({'tuples': [(1, 2), (3, 4)]})   
>   
>   
> In [3]: struct_type = pa.struct([('a', pa.int64()), ('b', pa.int64())])   
>   
>   
> In [4]: pa.array(df['tuples'], type=struct_type)  
>   
>   
> Out[4]: 
> 
> -- is_valid: all not null
> -- child 0 type: int64
>   [
> 1,
> 3
>   ]
> -- child 1 type: int64
>   [
> 2,
> 4
>   ]
> {code}
> But does not yet work when converting a DataFrame to Table while specifying 
> the type in a schema:
> {code}
> In [5]: pa.Table.from_pandas(df, schema=pa.schema([('tuples', struct_type)])) 
>   
>   
> ---
> KeyError  Traceback (most recent call last)
> ~/scipy/repos/arrow/python/pyarrow/pandas_compat.py in 
> get_logical_type(arrow_type)
>  68 try:
> ---> 69 return logical_type_map[arrow_type.id]
>  70 except KeyError:
> KeyError: 24
> During handling of the above exception, another exception occurred:
> NotImplementedError   Traceback (most recent call last)
>  in 
> > 1 pa.Table.from_pandas(df, schema=pa.schema([('tuples', struct_type)]))
> ~/scipy/repos/arrow/python/pyarrow/table.pxi in 
> pyarrow.lib.Table.from_pandas()
> ~/scipy/repos/arrow/python/pyarrow/pandas_compat.py in 
> dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe)
> 483 metadata = construct_metadata(df, column_names, index_columns,
> 484   index_descriptors, preserve_index,
> --> 485   types)
> 486 return all_names, arrays, metadata
> 487 
> ~/scipy/repos/arrow/python/pyarrow/pandas_compat.py in construct_metadata(df, 
> column_names, index_levels, index_descriptors, preserve_index, types)
> 207 metadata = get_column_metadata(df[col_name], 
> name=sanitized_name,
> 208arrow_type=arrow_type,
> --> 209field_name=sanitized_name)
> 210 column_metadata.append(metadata)
> 211 
> ~/scipy/repos/arrow/python/pyarrow/pandas_compat.py in 
> get_column_metadata(column, name, arrow_type, field_name)
> 149 dict
> 150 """
> --> 151 logical_type = get_logical_type(arrow_type)
> 152 
> 153 string_dtype, extra_metadata = get_extension_dtype_info(column)
> ~/scipy/repos/arrow/python/pyarrow/pandas_compat.py in 
> get_logical_type(arrow_type)
>  77 elif isinstance(arrow_type, pa.lib.Decimal128Type):
>  78 return 'decimal'
> ---> 79 raise NotImplementedError(str(arrow_type))
>  80 
>  81 
> NotImplementedError: struct
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5290) [Java] Provide a flag to enable/disable null-checking in vectors' get methods

2019-05-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney updated ARROW-5290:

Summary: [Java] Provide a flag to enable/disable null-checking in vectors' 
get methods  (was: Provide a flag to enable/disable null-checking in vectors' 
get methods)

> [Java] Provide a flag to enable/disable null-checking in vectors' get methods
> -
>
> Key: ARROW-5290
> URL: https://issues.apache.org/jira/browse/ARROW-5290
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> For vector classes, the get method first checks if the value at the given 
> index is null. If it is not null, the method goes ahead to retrieve the 
> value. 
> For some scenarios, the first check is redundant, because the application 
> code has already checked the null, before calling the get method. This 
> redundant check may have non-trivial performance overheads. 
> So we add a flag to enable/disable the null checking, so the user can set the 
> flag according to their own specific scenario. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (ARROW-5290) Provide a flag to enable/disable null-checking in vectors' get methods

2019-05-13 Thread Wes McKinney (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wes McKinney resolved ARROW-5290.
-
   Resolution: Fixed
Fix Version/s: 0.14.0

Issue resolved by pull request 4288
[https://github.com/apache/arrow/pull/4288]

> Provide a flag to enable/disable null-checking in vectors' get methods
> --
>
> Key: ARROW-5290
> URL: https://issues.apache.org/jira/browse/ARROW-5290
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> For vector classes, the get method first checks if the value at the given 
> index is null. If it is not null, the method goes ahead to retrieve the 
> value. 
> For some scenarios, the first check is redundant, because the application 
> code has already checked the null, before calling the get method. This 
> redundant check may have non-trivial performance overheads. 
> So we add a flag to enable/disable the null checking, so the user can set the 
> flag according to their own specific scenario. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5309) [Python] Add clarifications to Python "append" methods that return new objects

2019-05-13 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-5309:
---

 Summary: [Python] Add clarifications to Python "append" methods 
that return new objects
 Key: ARROW-5309
 URL: https://issues.apache.org/jira/browse/ARROW-5309
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Python
Reporter: Wes McKinney
 Fix For: 0.14.0


The current docstrings do say that an object is returned but it is not clear in 
all cases that it is a new object and the original object is left unmodified

see example thread

https://github.com/apache/arrow/issues/4296



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5308) [Go] remove deprecated Feather format

2019-05-13 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-5308:
--
Labels: pull-request-available  (was: )

> [Go] remove deprecated Feather format
> -
>
> Key: ARROW-5308
> URL: https://issues.apache.org/jira/browse/ARROW-5308
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Go
>Reporter: Sebastien Binet
>Priority: Major
>  Labels: pull-request-available
>
> we should probably consider removing the feather format files from the Go 
> backend.
> Feather is deprecated and right now the Go implementation is just the result 
> of the automatically generated code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5308) [Go] remove deprecated Feather format

2019-05-13 Thread Sebastien Binet (JIRA)
Sebastien Binet created ARROW-5308:
--

 Summary: [Go] remove deprecated Feather format
 Key: ARROW-5308
 URL: https://issues.apache.org/jira/browse/ARROW-5308
 Project: Apache Arrow
  Issue Type: Bug
  Components: Go
Reporter: Sebastien Binet


we should probably consider removing the feather format files from the Go 
backend.

Feather is deprecated and right now the Go implementation is just the result of 
the automatically generated code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5213) [Format] Script for updating various checked-in Flatbuffers files

2019-05-13 Thread Sebastien Binet (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16838516#comment-16838516
 ] 

Sebastien Binet commented on ARROW-5213:


FYI, re-generating the Go files is "as simple as":

 

{{$> cd go/arrow}}

{{$> go run ./gen-flatbuffers.go}}

 

(but one needs to have a Go SDK available.)

 

> [Format] Script for updating various checked-in Flatbuffers files
> -
>
> Key: ARROW-5213
> URL: https://issues.apache.org/jira/browse/ARROW-5213
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Format, Go, Rust
>Reporter: Wes McKinney
>Assignee: Andy Grove
>Priority: Major
>
> Some subprojects have begun checking in generated Flatbuffers files to source 
> control. This presents a maintainability issue when there are additions or 
> changes made to the .fbs sources. It would be useful to be able to automate 
> the update of these files so it doesn't have to happen on a manual / 
> case-by-case basis



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5306) [CI] [GLib] Disable GTK-Doc

2019-05-13 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-5306:
--
Labels: pull-request-available  (was: )

> [CI] [GLib] Disable GTK-Doc
> ---
>
> Key: ARROW-5306
> URL: https://issues.apache.org/jira/browse/ARROW-5306
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, GLib
>Reporter: Yosuke Shiro
>Assignee: Yosuke Shiro
>Priority: Major
>  Labels: pull-request-available
>
> Travis fails to process documents by GTK-Doc.
> [https://travis-ci.org/apache/arrow/jobs/531197944#L4170]
> This caused by the recent GTK-Doc upgrade to 0.13.0. So disable GTK-Doc until 
> 0.13.1 is released.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5306) [CI] [GLib] Disable GTK-Doc

2019-05-13 Thread Yosuke Shiro (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yosuke Shiro updated ARROW-5306:

Issue Type: Bug  (was: New Feature)

> [CI] [GLib] Disable GTK-Doc
> ---
>
> Key: ARROW-5306
> URL: https://issues.apache.org/jira/browse/ARROW-5306
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Continuous Integration, GLib
>Reporter: Yosuke Shiro
>Assignee: Yosuke Shiro
>Priority: Major
>
> Travis fails to process documents by GTK-Doc.
> [https://travis-ci.org/apache/arrow/jobs/531197944#L4170]
> This caused by the recent GTK-Doc upgrade to 0.13.0. So disable GTK-Doc until 
> 0.13.1 is released.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5307) [CI] [GLib] Enable GTK-Doc

2019-05-13 Thread Yosuke Shiro (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yosuke Shiro updated ARROW-5307:

Issue Type: Improvement  (was: New Feature)

> [CI] [GLib] Enable GTK-Doc
> --
>
> Key: ARROW-5307
> URL: https://issues.apache.org/jira/browse/ARROW-5307
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Continuous Integration, GLib
>Reporter: Yosuke Shiro
>Priority: Major
>
> Enable GTK-Doc when 0.13.1 is released.
> See https://issues.apache.org/jira/browse/ARROW-5306.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5307) [CI] [GLib] Enable GTK-Doc

2019-05-13 Thread Yosuke Shiro (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yosuke Shiro updated ARROW-5307:

Description: 
Enable GTK-Doc when 0.13.1 is released.

See https://issues.apache.org/jira/browse/ARROW-5306.

> [CI] [GLib] Enable GTK-Doc
> --
>
> Key: ARROW-5307
> URL: https://issues.apache.org/jira/browse/ARROW-5307
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Continuous Integration, GLib
>Reporter: Yosuke Shiro
>Priority: Major
>
> Enable GTK-Doc when 0.13.1 is released.
> See https://issues.apache.org/jira/browse/ARROW-5306.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5307) [CI] [GLib] Enable GTK-Doc

2019-05-13 Thread Yosuke Shiro (JIRA)
Yosuke Shiro created ARROW-5307:
---

 Summary: [CI] [GLib] Enable GTK-Doc
 Key: ARROW-5307
 URL: https://issues.apache.org/jira/browse/ARROW-5307
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Continuous Integration, GLib
Reporter: Yosuke Shiro






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5306) [CI] [GLib] Disable GTK-Doc

2019-05-13 Thread Yosuke Shiro (JIRA)
Yosuke Shiro created ARROW-5306:
---

 Summary: [CI] [GLib] Disable GTK-Doc
 Key: ARROW-5306
 URL: https://issues.apache.org/jira/browse/ARROW-5306
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Continuous Integration, GLib
Reporter: Yosuke Shiro
Assignee: Yosuke Shiro


Travis fails to process documents by GTK-Doc.
[https://travis-ci.org/apache/arrow/jobs/531197944#L4170]
This caused by the recent GTK-Doc upgrade to 0.13.0. So disable GTK-Doc until 
0.13.1 is released.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5290) Provide a flag to enable/disable null-checking in vectors' get methods

2019-05-13 Thread Liya Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liya Fan updated ARROW-5290:

Attachment: (was: safe.png)

> Provide a flag to enable/disable null-checking in vectors' get methods
> --
>
> Key: ARROW-5290
> URL: https://issues.apache.org/jira/browse/ARROW-5290
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> For vector classes, the get method first checks if the value at the given 
> index is null. If it is not null, the method goes ahead to retrieve the 
> value. 
> For some scenarios, the first check is redundant, because the application 
> code has already checked the null, before calling the get method. This 
> redundant check may have non-trivial performance overheads. 
> So we add a flag to enable/disable the null checking, so the user can set the 
> flag according to their own specific scenario. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-3424) [Python] Improved workflow for loading an arbitrary collection of Parquet files

2019-05-13 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-3424?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16838442#comment-16838442
 ] 

Joris Van den Bossche commented on ARROW-3424:
--

Currently, a list of files is already supported in {{ParquetDataset}}. So 
something like this (that would address the SO question, I think) works:
 
{code:java}
dataset = pq.ParquetDataset(['part0.parquet', 'part1.parquet'])
dataset.read_pandas().to_pandas()
{code}

Do we think that is enough support? (if so, this issue can be closed I think) 
Or do we want to add this to {{pq.read_table}} ? (which eg also accepts a 
directory name, which is then passed through to {{ParquetDataset}}. We could do 
a similar pass through for a list of paths)


> [Python] Improved workflow for loading an arbitrary collection of Parquet 
> files
> ---
>
> Key: ARROW-3424
> URL: https://issues.apache.org/jira/browse/ARROW-3424
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Wes McKinney
>Priority: Major
>  Labels: parquet
> Fix For: 0.14.0
>
>
> See SO question for use case: 
> https://stackoverflow.com/questions/52613682/load-multiple-parquet-files-into-dataframe-for-analysis



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5286) [Python] support Structs in Table.from_pandas given a known schema

2019-05-13 Thread ASF GitHub Bot (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated ARROW-5286:
--
Labels: pull-request-available  (was: )

> [Python] support Structs in Table.from_pandas given a known schema
> --
>
> Key: ARROW-5286
> URL: https://issues.apache.org/jira/browse/ARROW-5286
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
>  Labels: pull-request-available
> Fix For: 0.14.0
>
>
> ARROW-2073 implemented creating a StructArray from an array of tuples (in 
> addition to from dicts). 
> This works in {{pyarrow.array}} (specifying the proper type):
> {code}
> In [2]: df = pd.DataFrame({'tuples': [(1, 2), (3, 4)]})   
>   
>   
> In [3]: struct_type = pa.struct([('a', pa.int64()), ('b', pa.int64())])   
>   
>   
> In [4]: pa.array(df['tuples'], type=struct_type)  
>   
>   
> Out[4]: 
> 
> -- is_valid: all not null
> -- child 0 type: int64
>   [
> 1,
> 3
>   ]
> -- child 1 type: int64
>   [
> 2,
> 4
>   ]
> {code}
> But does not yet work when converting a DataFrame to Table while specifying 
> the type in a schema:
> {code}
> In [5]: pa.Table.from_pandas(df, schema=pa.schema([('tuples', struct_type)])) 
>   
>   
> ---
> KeyError  Traceback (most recent call last)
> ~/scipy/repos/arrow/python/pyarrow/pandas_compat.py in 
> get_logical_type(arrow_type)
>  68 try:
> ---> 69 return logical_type_map[arrow_type.id]
>  70 except KeyError:
> KeyError: 24
> During handling of the above exception, another exception occurred:
> NotImplementedError   Traceback (most recent call last)
>  in 
> > 1 pa.Table.from_pandas(df, schema=pa.schema([('tuples', struct_type)]))
> ~/scipy/repos/arrow/python/pyarrow/table.pxi in 
> pyarrow.lib.Table.from_pandas()
> ~/scipy/repos/arrow/python/pyarrow/pandas_compat.py in 
> dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe)
> 483 metadata = construct_metadata(df, column_names, index_columns,
> 484   index_descriptors, preserve_index,
> --> 485   types)
> 486 return all_names, arrays, metadata
> 487 
> ~/scipy/repos/arrow/python/pyarrow/pandas_compat.py in construct_metadata(df, 
> column_names, index_levels, index_descriptors, preserve_index, types)
> 207 metadata = get_column_metadata(df[col_name], 
> name=sanitized_name,
> 208arrow_type=arrow_type,
> --> 209field_name=sanitized_name)
> 210 column_metadata.append(metadata)
> 211 
> ~/scipy/repos/arrow/python/pyarrow/pandas_compat.py in 
> get_column_metadata(column, name, arrow_type, field_name)
> 149 dict
> 150 """
> --> 151 logical_type = get_logical_type(arrow_type)
> 152 
> 153 string_dtype, extra_metadata = get_extension_dtype_info(column)
> ~/scipy/repos/arrow/python/pyarrow/pandas_compat.py in 
> get_logical_type(arrow_type)
>  77 elif isinstance(arrow_type, pa.lib.Decimal128Type):
>  78 return 'decimal'
> ---> 79 raise NotImplementedError(str(arrow_type))
>  80 
>  81 
> NotImplementedError: struct
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5286) [Python] support Structs in Table.from_pandas given a known schema

2019-05-13 Thread Joris Van den Bossche (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5286?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16838389#comment-16838389
 ] 

Joris Van den Bossche commented on ARROW-5286:
--

Actually, also converting from dicts (without the need to specify the schema) 
shows the same limitation: it works in {{pa.array(..)}} but not in 
{{pa.Table.from_pandas(..)}}:

 
{code:java}
In [14]: df = pd.DataFrame({'dicts': [{'a': 1, 'b': 2}, {'a': 3, 'b': 4}]})

In [15]: pa.array(df['dicts']) 
Out[15]:

-- is_valid: all not null
-- child 0 type: int64
  [
    1,
    3
  ]
-- child 1 type: int64
  [
    2,
    4
  ]

In [16]: pa.Table.from_pandas(df)    
...
NotImplementedError: struct{code}

> [Python] support Structs in Table.from_pandas given a known schema
> --
>
> Key: ARROW-5286
> URL: https://issues.apache.org/jira/browse/ARROW-5286
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
> Fix For: 0.14.0
>
>
> ARROW-2073 implemented creating a StructArray from an array of tuples (in 
> addition to from dicts). 
> This works in {{pyarrow.array}} (specifying the proper type):
> {code}
> In [2]: df = pd.DataFrame({'tuples': [(1, 2), (3, 4)]})   
>   
>   
> In [3]: struct_type = pa.struct([('a', pa.int64()), ('b', pa.int64())])   
>   
>   
> In [4]: pa.array(df['tuples'], type=struct_type)  
>   
>   
> Out[4]: 
> 
> -- is_valid: all not null
> -- child 0 type: int64
>   [
> 1,
> 3
>   ]
> -- child 1 type: int64
>   [
> 2,
> 4
>   ]
> {code}
> But does not yet work when converting a DataFrame to Table while specifying 
> the type in a schema:
> {code}
> In [5]: pa.Table.from_pandas(df, schema=pa.schema([('tuples', struct_type)])) 
>   
>   
> ---
> KeyError  Traceback (most recent call last)
> ~/scipy/repos/arrow/python/pyarrow/pandas_compat.py in 
> get_logical_type(arrow_type)
>  68 try:
> ---> 69 return logical_type_map[arrow_type.id]
>  70 except KeyError:
> KeyError: 24
> During handling of the above exception, another exception occurred:
> NotImplementedError   Traceback (most recent call last)
>  in 
> > 1 pa.Table.from_pandas(df, schema=pa.schema([('tuples', struct_type)]))
> ~/scipy/repos/arrow/python/pyarrow/table.pxi in 
> pyarrow.lib.Table.from_pandas()
> ~/scipy/repos/arrow/python/pyarrow/pandas_compat.py in 
> dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe)
> 483 metadata = construct_metadata(df, column_names, index_columns,
> 484   index_descriptors, preserve_index,
> --> 485   types)
> 486 return all_names, arrays, metadata
> 487 
> ~/scipy/repos/arrow/python/pyarrow/pandas_compat.py in construct_metadata(df, 
> column_names, index_levels, index_descriptors, preserve_index, types)
> 207 metadata = get_column_metadata(df[col_name], 
> name=sanitized_name,
> 208arrow_type=arrow_type,
> --> 209field_name=sanitized_name)
> 210 column_metadata.append(metadata)
> 211 
> ~/scipy/repos/arrow/python/pyarrow/pandas_compat.py in 
> get_column_metadata(column, name, arrow_type, field_name)
> 149 dict
> 150 """
> --> 151 logical_type = get_logical_type(arrow_type)
> 152 
> 153 string_dtype, extra_metadata = get_extension_dtype_info(column)
> ~/scipy/repos/arrow/python/pyarrow/pandas_compat.py in 
> get_logical_type(arrow_type)
>  77 elif isinstance(arrow_type, pa.lib.Decimal128Type):
>  78 return 'decimal'
> ---> 79 raise NotImplementedError(str(arrow_type))
>  80 
>  81 
> NotImplementedError: struct
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (ARROW-5286) [Python] support Structs in Table.from_pandas given a known schema

2019-05-13 Thread Joris Van den Bossche (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche reassigned ARROW-5286:


Assignee: Joris Van den Bossche

> [Python] support Structs in Table.from_pandas given a known schema
> --
>
> Key: ARROW-5286
> URL: https://issues.apache.org/jira/browse/ARROW-5286
> Project: Apache Arrow
>  Issue Type: Improvement
>  Components: Python
>Reporter: Joris Van den Bossche
>Assignee: Joris Van den Bossche
>Priority: Major
> Fix For: 0.14.0
>
>
> ARROW-2073 implemented creating a StructArray from an array of tuples (in 
> addition to from dicts). 
> This works in {{pyarrow.array}} (specifying the proper type):
> {code}
> In [2]: df = pd.DataFrame({'tuples': [(1, 2), (3, 4)]})   
>   
>   
> In [3]: struct_type = pa.struct([('a', pa.int64()), ('b', pa.int64())])   
>   
>   
> In [4]: pa.array(df['tuples'], type=struct_type)  
>   
>   
> Out[4]: 
> 
> -- is_valid: all not null
> -- child 0 type: int64
>   [
> 1,
> 3
>   ]
> -- child 1 type: int64
>   [
> 2,
> 4
>   ]
> {code}
> But does not yet work when converting a DataFrame to Table while specifying 
> the type in a schema:
> {code}
> In [5]: pa.Table.from_pandas(df, schema=pa.schema([('tuples', struct_type)])) 
>   
>   
> ---
> KeyError  Traceback (most recent call last)
> ~/scipy/repos/arrow/python/pyarrow/pandas_compat.py in 
> get_logical_type(arrow_type)
>  68 try:
> ---> 69 return logical_type_map[arrow_type.id]
>  70 except KeyError:
> KeyError: 24
> During handling of the above exception, another exception occurred:
> NotImplementedError   Traceback (most recent call last)
>  in 
> > 1 pa.Table.from_pandas(df, schema=pa.schema([('tuples', struct_type)]))
> ~/scipy/repos/arrow/python/pyarrow/table.pxi in 
> pyarrow.lib.Table.from_pandas()
> ~/scipy/repos/arrow/python/pyarrow/pandas_compat.py in 
> dataframe_to_arrays(df, schema, preserve_index, nthreads, columns, safe)
> 483 metadata = construct_metadata(df, column_names, index_columns,
> 484   index_descriptors, preserve_index,
> --> 485   types)
> 486 return all_names, arrays, metadata
> 487 
> ~/scipy/repos/arrow/python/pyarrow/pandas_compat.py in construct_metadata(df, 
> column_names, index_levels, index_descriptors, preserve_index, types)
> 207 metadata = get_column_metadata(df[col_name], 
> name=sanitized_name,
> 208arrow_type=arrow_type,
> --> 209field_name=sanitized_name)
> 210 column_metadata.append(metadata)
> 211 
> ~/scipy/repos/arrow/python/pyarrow/pandas_compat.py in 
> get_column_metadata(column, name, arrow_type, field_name)
> 149 dict
> 150 """
> --> 151 logical_type = get_logical_type(arrow_type)
> 152 
> 153 string_dtype, extra_metadata = get_extension_dtype_info(column)
> ~/scipy/repos/arrow/python/pyarrow/pandas_compat.py in 
> get_logical_type(arrow_type)
>  77 elif isinstance(arrow_type, pa.lib.Decimal128Type):
>  78 return 'decimal'
> ---> 79 raise NotImplementedError(str(arrow_type))
>  80 
>  81 
> NotImplementedError: struct
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Issue Comment Deleted] (ARROW-5290) Provide a flag to enable/disable null-checking in vectors' get methods

2019-05-13 Thread Liya Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liya Fan updated ARROW-5290:

Comment: was deleted

(was: The assembly code of the unsafe API)

> Provide a flag to enable/disable null-checking in vectors' get methods
> --
>
> Key: ARROW-5290
> URL: https://issues.apache.org/jira/browse/ARROW-5290
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
> Attachments: safe.png
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> For vector classes, the get method first checks if the value at the given 
> index is null. If it is not null, the method goes ahead to retrieve the 
> value. 
> For some scenarios, the first check is redundant, because the application 
> code has already checked the null, before calling the get method. This 
> redundant check may have non-trivial performance overheads. 
> So we add a flag to enable/disable the null checking, so the user can set the 
> flag according to their own specific scenario. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Issue Comment Deleted] (ARROW-5290) Provide a flag to enable/disable null-checking in vectors' get methods

2019-05-13 Thread Liya Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liya Fan updated ARROW-5290:

Comment: was deleted

(was: The assembly code of the safe API (when the null-checking is disabled)
 !safe.png! )

> Provide a flag to enable/disable null-checking in vectors' get methods
> --
>
> Key: ARROW-5290
> URL: https://issues.apache.org/jira/browse/ARROW-5290
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
> Attachments: safe.png
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> For vector classes, the get method first checks if the value at the given 
> index is null. If it is not null, the method goes ahead to retrieve the 
> value. 
> For some scenarios, the first check is redundant, because the application 
> code has already checked the null, before calling the get method. This 
> redundant check may have non-trivial performance overheads. 
> So we add a flag to enable/disable the null checking, so the user can set the 
> flag according to their own specific scenario. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (ARROW-5224) [Java] Add APIs for supporting directly serialize/deserialize ValueVector

2019-05-13 Thread Ji Liu (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16838299#comment-16838299
 ] 

Ji Liu edited comment on ARROW-5224 at 5/13/19 8:08 AM:


[~emkornfi...@gmail.com] Thanks for your reply.

For #2 you are right.

For #1, for example, if we do encoding Int or BigInt type  like 
[https://en.wikipedia.org/wiki/LEB128], we need to read each value and 
reassemble byte, and the deserialize process as well. Can this be achieved by 
existing implementation? Besides, is compression supported?


was (Author: tianchen92):
[~emkornfi...@gmail.com] Thanks for your reply.

For #2 you are right.

For #1, for example, if we do encoding Int or BigInt type  like 
[https://en.wikipedia.org/wiki/LEB128], we need to read each value and 
reassemble byte, and the deserialize process as well. Can this be achieved by 
existing implementation?

> [Java] Add APIs for supporting directly serialize/deserialize ValueVector
> -
>
> Key: ARROW-5224
> URL: https://issues.apache.org/jira/browse/ARROW-5224
> Project: Apache Arrow
>  Issue Type: Improvement
>Reporter: Ji Liu
>Assignee: Ji Liu
>Priority: Minor
>  Labels: pull-request-available
>  Time Spent: 2h 20m
>  Remaining Estimate: 0h
>
> There is no API to directly serialize/deserialize ValueVector. The only way 
> to implement this is to put a single FieldVector in VectorSchemaRoot and 
> convert it to ArrowRecordBatch, and the deserialize process is as well. 
> Provide a utility class to implement this may be better, I know all 
> serializations should follow IPC format so that data can be shared between 
> different Arrow implementations. But for users who only use Java API and want 
> to do some further optimization, this seem to be no problem and we could 
> provide them a more option.
> This may take some benefits for Java user who only use ValueVector rather 
> than IPC series classes such as ArrowReordBatch:
>  * We could do some shuffle optimization such as compression and some 
> encoding algorithm for numerical type which could greatly improve performance.
>  * Do serialize/deserialize with the actual buffer size within vector since 
> the buffer size is power of 2 which is actually bigger than it really need.
>  * Reduce data conversion(VectorSchemaRoot, ArrowRecordBatch etc) to make it 
> user-friendly.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5290) Provide a flag to enable/disable null-checking in vectors' get methods

2019-05-13 Thread Liya Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16838319#comment-16838319
 ] 

Liya Fan commented on ARROW-5290:
-

The assembly code of the unsafe API

> Provide a flag to enable/disable null-checking in vectors' get methods
> --
>
> Key: ARROW-5290
> URL: https://issues.apache.org/jira/browse/ARROW-5290
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
> Attachments: safe.png
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> For vector classes, the get method first checks if the value at the given 
> index is null. If it is not null, the method goes ahead to retrieve the 
> value. 
> For some scenarios, the first check is redundant, because the application 
> code has already checked the null, before calling the get method. This 
> redundant check may have non-trivial performance overheads. 
> So we add a flag to enable/disable the null checking, so the user can set the 
> flag according to their own specific scenario. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (ARROW-5290) Provide a flag to enable/disable null-checking in vectors' get methods

2019-05-13 Thread Liya Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/ARROW-5290?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16838317#comment-16838317
 ] 

Liya Fan commented on ARROW-5290:
-

The assembly code of the safe API (when the null-checking is disabled)
 !safe.png! 

> Provide a flag to enable/disable null-checking in vectors' get methods
> --
>
> Key: ARROW-5290
> URL: https://issues.apache.org/jira/browse/ARROW-5290
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
> Attachments: safe.png
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> For vector classes, the get method first checks if the value at the given 
> index is null. If it is not null, the method goes ahead to retrieve the 
> value. 
> For some scenarios, the first check is redundant, because the application 
> code has already checked the null, before calling the get method. This 
> redundant check may have non-trivial performance overheads. 
> So we add a flag to enable/disable the null checking, so the user can set the 
> flag according to their own specific scenario. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (ARROW-5290) Provide a flag to enable/disable null-checking in vectors' get methods

2019-05-13 Thread Liya Fan (JIRA)


 [ 
https://issues.apache.org/jira/browse/ARROW-5290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liya Fan updated ARROW-5290:

Attachment: safe.png

> Provide a flag to enable/disable null-checking in vectors' get methods
> --
>
> Key: ARROW-5290
> URL: https://issues.apache.org/jira/browse/ARROW-5290
> Project: Apache Arrow
>  Issue Type: New Feature
>  Components: Java
>Reporter: Liya Fan
>Assignee: Liya Fan
>Priority: Major
>  Labels: pull-request-available
> Attachments: safe.png
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> For vector classes, the get method first checks if the value at the given 
> index is null. If it is not null, the method goes ahead to retrieve the 
> value. 
> For some scenarios, the first check is redundant, because the application 
> code has already checked the null, before calling the get method. This 
> redundant check may have non-trivial performance overheads. 
> So we add a flag to enable/disable the null checking, so the user can set the 
> flag according to their own specific scenario. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)