[jira] [Created] (ARROW-5411) Build error building on Mac OS Mojave

2019-05-23 Thread Miguel Cabrera (JIRA)
Miguel Cabrera created ARROW-5411:
-

 Summary: Build error building on Mac OS Mojave
 Key: ARROW-5411
 URL: https://issues.apache.org/jira/browse/ARROW-5411
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++, Python
 Environment: Mac OSX Mojave 10.14.5
Anaconda 4.6.14
XCode 10.2.1
CLANGXX=/Users/mcabrera/anaconda3/envs/pyarrow-dev/bin/x86_64-apple-darwin13.4.0-clang++
CLANG=/Users/mcabrera/anaconda3/envs/pyarrow-dev/bin/x86_64-apple-darwin13.4.0-clang

Reporter: Miguel Cabrera


After following the instruction on the Python development and Building 
instruction for C++, I get a linking error:

 
{code:java}
$ pwd
/Users/mcabrera/dev/arrow/cpp/release
$ cmake -DARROW_BUILD_TESTS=ON  ..

()

ld: warning: ignoring file 
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/usr/lib/libSystem.tbd,
 file was built for unsupported file format ( 0x2D 0x2D 0x2D 0x20 0x21 0x74 
0x61 0x70 0x69 0x2D 0x74 0x62 0x64 0x2D 0x76 0x33 ) which is not the 
architecture being linked (x86_64): 
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/usr/lib/libSystem.tbd}}
 ld: dynamic main executables must link with libSystem.dylib for architecture 
x86_64
clang-4.0: error: linker command failed with exit code 1 (use -v to see 
invocation)
make[1]: *** [cmTC_510d0] Error 1
make: *** [cmTC_510d0/fast] Error 2

 {code}
Same issue if I follow the instructions on the Python Development documentation
{code:java}
 mkdir arrow/cpp/build
 pushd arrow/cpp/build
cmake -DCMAKE_INSTALL_PREFIX=$ARROW_HOME \
  -DCMAKE_INSTALL_LIBDIR=lib \
  -DARROW_FLIGHT=ON \
  -DARROW_GANDIVA=ON \
  -DARROW_ORC=ON \
  -DARROW_PARQUET=ON \
  -DARROW_PYTHON=ON \
  -DARROW_PLASMA=ON \
  -DARROW_BUILD_TESTS=ON \
  ..
{code}
The Python development documentation is not clear whether in order to build the 
Python (and CPP library) brew depedencies are necessary (or just with the 
Anaconda is enough) so I installed them nonetheless. However I get the same 
issue
h2. Enviornment
{code:java}
Mac OSX Mojave 10.14.5
Anaconda 4.6.14
XCode 10.2.1
CLANGXX=/Users/mcabrera/anaconda3/envs/pyarrow-dev/bin/x86_64-apple-darwin13.4.0-clang++
CLANG=/Users/mcabrera/anaconda3/envs/pyarrow-dev/bin/x86_64-apple-darwin13.4.0-clang{code}
 

 

 

 

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5410) Crash at arrow::internal::FileWrite

2019-05-23 Thread Tham (JIRA)
Tham created ARROW-5410:
---

 Summary: Crash at arrow::internal::FileWrite
 Key: ARROW-5410
 URL: https://issues.apache.org/jira/browse/ARROW-5410
 Project: Apache Arrow
  Issue Type: Bug
 Environment: Windows version 10.0.14393.0 (rs1_release.160715-1616)

Reporter: Tham


My application is writing a bunch of parquet files and it often crashes. Most 
of the time it crashes when writing the first file, sometimes it can write the 
first file and crashing at the 2nd file. The file can always be opened. It only 
crashes at writeTable.

As I tested, my application crashes when build with release mode, but don't 
crash with debug mode. It crashed only on one Windows machine, not others.

Here is stack trace from dump file:
{code:java}
STACK_TEXT:  
001e`10efd840 7ffc`0333d53f : ` 001e`10efe230 
`0033 7ffc`032dbe21 : 
CortexSync!google_breakpad::ExceptionHandler::HandleInvalidParameter+0x1a0
001e`10efe170 7ffc`0333d559 : `ff02 7ffc`032da63d 
`0033 `0033 : ucrtbase!invalid_parameter+0x13f
001e`10efe1b0 7ffc`03318664 : 7ff7`7f7c8489 `ff02 
001e`10efe230 `0033 : ucrtbase!invalid_parameter_noinfo+0x9
001e`10efe1f0 7ffc`032d926d : ` `0140 
`0005 0122`bbe61e30 : 
ucrtbase!_acrt_uninitialize_command_line+0x6fd4
001e`10efe250 7ff7`7f66585e : 0010`0005 ` 
001e`10efe560 0122`b2337b88 : ucrtbase!write+0x8d
001e`10efe2a0 7ff7`7f632785 : 7ff7` 7ff7`7f7bb153 
0122`bbe890e0 001e`10efe634 : CortexSync!arrow::internal::FileWrite+0x5e
001e`10efe360 7ff7`7f632442 : `348a `0004 
733f`5e86f38c 0122`bbe14c40 : CortexSync!arrow::io::OSFile::Write+0x1d5
001e`10efe510 7ff7`7f71c1b9 : 001e`10efe738 7ff7`7f665522 
0122`bbffe6e0 ` : 
CortexSync!arrow::io::FileOutputStream::Write+0x12
001e`10efe540 7ff7`7f79cb2f : 0122`bbe61e30 0122`bbffe6e0 
`0013 001e`10efe730 : 
CortexSync!parquet::ArrowOutputStream::Write+0x39
001e`10efe6e0 7ff7`7f7abbaf : 7ff7`7fd75b78 7ff7`7fd75b78 
001e`10efe9c0 ` : 
CortexSync!parquet::ThriftSerializer::Serialize+0x11f
001e`10efe8c0 7ff7`7f7aaf93 : ` 0122`bbe3f450 
`0002 0122`bc0218d0 : 
CortexSync!parquet::SerializedPageWriter::WriteDictionaryPage+0x44f
001e`10efee20 7ff7`7f7a3707 : 0122`bbe3f450 001e`10eff250 
` 0122`b168 : 
CortexSync!parquet::TypedColumnWriterImpl 
>::WriteDictionaryPage+0x143
001e`10efeed0 7ff7`7f710480 : 001e`10eff1c0 ` 
0122`bbe3f540 0122`b2439998 : 
CortexSync!parquet::ColumnWriterImpl::Close+0x47
001e`10efef60 7ff7`7f7154da : 0122`bbec3cd0 001e`10eff1c0 
0122`bbec4bb0 0122`b2439998 : 
CortexSync!parquet::arrow::FileWriter::Impl::`vector deleting destructor'+0x100
001e`10efefa0 7ff7`7f71619c : ` 001e`10eff1c0 
0122`bbe89390 ` : 
CortexSync!parquet::arrow::FileWriter::Impl::WriteColumnChunk+0x6fa
001e`10eff150 7ff7`7f202de9 : `0001 001e`10eff430 
`000f ` : 
CortexSync!parquet::arrow::FileWriter::WriteTable+0x6cc
001e`10eff410 7ff7`7f18baf3 : 0122`bbec39b0 0122`b24c53f8 
`3f80 ` : 
CortexSync!Cortex::Storage::ParquetStreamWriter::writeRowGroup+0x49{code}
I tried a lot of ways to find out the root cause, but failed. Can anyone here 
give me some information/advice please, so that I can investigate more? Thanks!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [Discuss][Format][Java] Finalizing Union Types

2019-05-23 Thread Micah Kornfield
I'd like to bump this thread, to see if anyone has any comments.  If nobody
objects I will try to start implementing the changes next week.

Thanks,
Micah

On Mon, May 20, 2019 at 9:37 PM Micah Kornfield 
wrote:

> In the past [1] there hasn't been agreement on the final requirements for
> union types.
>
> Briefly the two approaches that are currently advocated:
> 1.  Limit unions to only contain one field of each individual type (e.g.
> you can't have two separate int32 fields).  Java takes this approach.
> 2.  Generalized unions (unions can have any number of fields with the same
> type).  C++ takes this approach.
>
> There was a prior PR [2] that stalled in trying to take this approach with
> Java.  For writing vectors it seemed to be slower on a benchmark.
>
> My proposal:  We should pursue option 2 (the general approach).  There are
> already data interchange formats that support it and it would be nice to a
> data-model that lets us make the translation between Arrow schemas easy:
> 1.  Avro Seems to support it [3] (with the exception of complex types)
> 2.  Protobufs loosely support it [4] via one-of.
>
> In order to address issues in [2], I propose the following making the
> changes/additions to the Java implementation:
> 1.  Keep the default write-path untouched with the existing class.
> 2.  Add in a new sparse union class that implements the same interface
> that can be used on the read path, and if a client opts in (via direct
> construction).
> 3.  Add in a dense union class (I don't believe Java has one).
>
> I'm still ramping up the Java code base, so I'd like other Java
> contributors to chime in to see if this plan sounds feasible and acceptable.
>
> Any other thoughts on Unions?
>
> Thanks,
> Micah
>
> [1]
> https://lists.apache.org/thread.html/82ec2049fc3c29de232c9c6962aaee9ec022d581cecb6cf0eb6a8f36@%3Cdev.arrow.apache.org%3E
> [2] https://github.com/apache/arrow/pull/987#issuecomment-493231493
> [3] https://github.com/apache/arrow/pull/987#issuecomment-493231493
> [4] https://developers.google.com/protocol-buffers/docs/proto#oneof
>


[jira] [Created] (ARROW-5409) [C++] Improvement for IsIn Kernel when right array is small

2019-05-23 Thread Preeti Suman (JIRA)
Preeti Suman created ARROW-5409:
---

 Summary: [C++] Improvement for IsIn Kernel when right array is 
small
 Key: ARROW-5409
 URL: https://issues.apache.org/jira/browse/ARROW-5409
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Preeti Suman


The core of the algorithm (as python) is 
{code:java}
for idx, elem in array:
  output[i] = (elem in memo_table)
{code}
 Often the right operand list will be very small, in this case, the hashtable 
should be replaced with a constant vector. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5408) [Rust] Create struct array builder that creates null buffers

2019-05-23 Thread Neville Dipale (JIRA)
Neville Dipale created ARROW-5408:
-

 Summary: [Rust] Create struct array builder that creates null 
buffers
 Key: ARROW-5408
 URL: https://issues.apache.org/jira/browse/ARROW-5408
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Rust
Reporter: Neville Dipale


We currently have a way of creating a struct array from a list of (field, 
array) tuples. This does not create null buffers for the struct (because no 
index is null). While this works fine for Rust, it often leads to incompatible 
data with IPC data and kernel function outputs.

Having a function that caters for nulls, or expanding the current one, would 
alleviate this issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Python development setup and LLVM 7 / Gandiva

2019-05-23 Thread John Muehlhausen
Not sure why cmake isn't happy (as in original post).  Environment is set
up as per instructions:

(pyarrow-dev) JGM-KTG-Mac-Mini:python jmuehlhausen$ conda list llvmdev
# packages in environment at
/Users/jmuehlhausen/miniconda3/envs/pyarrow-dev:
#
# NameVersion   Build  Channel
llvmdev   7.0.0 h04f5b5a_1000conda-forge

On Thu, May 23, 2019 at 1:46 PM Wes McKinney  wrote:

> llvmdev=7 is in the conda_env_cpp.yml requirements file, are you using
> something else?
>
> https://github.com/apache/arrow/blob/master/ci/conda_env_cpp.yml#L31
>
> On Thu, May 23, 2019 at 12:53 PM John Muehlhausen  wrote:
> >
> > The pyarrow-dev conda environment does not include llvm 7, which appears
> to
> > be a requirement for Gandiva.
> >
> > So I'm just trying to figure out a pain-free way to add llvm 7 in a way
> > that cmake can find it, for Mac.
> >
> > I had already solved the other Mac problem with
> > export CONDA_BUILD_SYSROOT=/Users/jmuehlhausen/sdks/MacOSX10.9.sdk
> >
> > On Wed, May 22, 2019 at 1:46 PM Wes McKinney 
> wrote:
> >
> > > hi John,
> > >
> > > Some changes were just made to address the issue you are having, see
> > > the latest instructions at
> > >
> > >
> > >
> https://github.com/apache/arrow/blob/master/docs/source/developers/python.rst
> > >
> > > Let us know if that does not work.
> > >
> > > - Wes
> > >
> > > On Wed, May 22, 2019 at 11:02 AM John Muehlhausen  wrote:
> > > >
> > > > Set up pyarrow-dev conda environment as at
> > > > https://arrow.apache.org/docs/developers/python.html
> > > >
> > > > Got the following error.  I will disable Gandiva for now but I'd
> like to
> > > > get it back at some point.  I'm on Mac OS 10.13.6.
> > > >
> > > > CMake Error at cmake_modules/FindLLVM.cmake:33 (find_package):
> > > >   Could not find a configuration file for package "LLVM" that is
> > > compatible
> > > >   with requested version "7.0".
> > > >
> > > >   The following configuration files were considered but not accepted:
> > > >
> > > >
> > > >
> > >
> /Users/jmuehlhausen/miniconda3/envs/pyarrow-dev/lib/cmake/llvm/LLVMConfig.cmake,
> > > > version: 4.0.1
> > > >
> > > >
> > >
> /Users/jmuehlhausen/miniconda3/envs/pyarrow-dev/lib/cmake/llvm/llvm-config.cmake,
> > > > version: unknown
> > > >
> > > > Call Stack (most recent call first):
> > > >   src/gandiva/CMakeLists.txt:31 (find_package)
> > >
>


Re: Python development setup and LLVM 7 / Gandiva

2019-05-23 Thread Wes McKinney
llvmdev=7 is in the conda_env_cpp.yml requirements file, are you using
something else?

https://github.com/apache/arrow/blob/master/ci/conda_env_cpp.yml#L31

On Thu, May 23, 2019 at 12:53 PM John Muehlhausen  wrote:
>
> The pyarrow-dev conda environment does not include llvm 7, which appears to
> be a requirement for Gandiva.
>
> So I'm just trying to figure out a pain-free way to add llvm 7 in a way
> that cmake can find it, for Mac.
>
> I had already solved the other Mac problem with
> export CONDA_BUILD_SYSROOT=/Users/jmuehlhausen/sdks/MacOSX10.9.sdk
>
> On Wed, May 22, 2019 at 1:46 PM Wes McKinney  wrote:
>
> > hi John,
> >
> > Some changes were just made to address the issue you are having, see
> > the latest instructions at
> >
> >
> > https://github.com/apache/arrow/blob/master/docs/source/developers/python.rst
> >
> > Let us know if that does not work.
> >
> > - Wes
> >
> > On Wed, May 22, 2019 at 11:02 AM John Muehlhausen  wrote:
> > >
> > > Set up pyarrow-dev conda environment as at
> > > https://arrow.apache.org/docs/developers/python.html
> > >
> > > Got the following error.  I will disable Gandiva for now but I'd like to
> > > get it back at some point.  I'm on Mac OS 10.13.6.
> > >
> > > CMake Error at cmake_modules/FindLLVM.cmake:33 (find_package):
> > >   Could not find a configuration file for package "LLVM" that is
> > compatible
> > >   with requested version "7.0".
> > >
> > >   The following configuration files were considered but not accepted:
> > >
> > >
> > >
> > /Users/jmuehlhausen/miniconda3/envs/pyarrow-dev/lib/cmake/llvm/LLVMConfig.cmake,
> > > version: 4.0.1
> > >
> > >
> > /Users/jmuehlhausen/miniconda3/envs/pyarrow-dev/lib/cmake/llvm/llvm-config.cmake,
> > > version: unknown
> > >
> > > Call Stack (most recent call first):
> > >   src/gandiva/CMakeLists.txt:31 (find_package)
> >


Re: Python development setup and LLVM 7 / Gandiva

2019-05-23 Thread John Muehlhausen
The pyarrow-dev conda environment does not include llvm 7, which appears to
be a requirement for Gandiva.

So I'm just trying to figure out a pain-free way to add llvm 7 in a way
that cmake can find it, for Mac.

I had already solved the other Mac problem with
export CONDA_BUILD_SYSROOT=/Users/jmuehlhausen/sdks/MacOSX10.9.sdk

On Wed, May 22, 2019 at 1:46 PM Wes McKinney  wrote:

> hi John,
>
> Some changes were just made to address the issue you are having, see
> the latest instructions at
>
>
> https://github.com/apache/arrow/blob/master/docs/source/developers/python.rst
>
> Let us know if that does not work.
>
> - Wes
>
> On Wed, May 22, 2019 at 11:02 AM John Muehlhausen  wrote:
> >
> > Set up pyarrow-dev conda environment as at
> > https://arrow.apache.org/docs/developers/python.html
> >
> > Got the following error.  I will disable Gandiva for now but I'd like to
> > get it back at some point.  I'm on Mac OS 10.13.6.
> >
> > CMake Error at cmake_modules/FindLLVM.cmake:33 (find_package):
> >   Could not find a configuration file for package "LLVM" that is
> compatible
> >   with requested version "7.0".
> >
> >   The following configuration files were considered but not accepted:
> >
> >
> >
> /Users/jmuehlhausen/miniconda3/envs/pyarrow-dev/lib/cmake/llvm/LLVMConfig.cmake,
> > version: 4.0.1
> >
> >
> /Users/jmuehlhausen/miniconda3/envs/pyarrow-dev/lib/cmake/llvm/llvm-config.cmake,
> > version: unknown
> >
> > Call Stack (most recent call first):
> >   src/gandiva/CMakeLists.txt:31 (find_package)
>


Re: Java/Scala: efficient reading of Parquet into Arrow?

2019-05-23 Thread Wes McKinney
Cool. At some point we are interested in having simple compressed
(e.g. with LZ4 or ZSTD) record batches natively in the Arrow protocol,
see

https://issues.apache.org/jira/browse/ARROW-300

On Thu, May 23, 2019 at 10:21 AM Joris Peeters
 wrote:
>
> Cool, thanks. I think we'll just go with reading LZ4 compressed Arrow
> directly from disk then, and by-pass Parquet altogether.
> The compressed Arrow files are about 20% larger than the PQ files, but
> getting it into some useful form in memory is almost on par with pandas.
>
> At the moment, I don't need the additional benefits parquet would give
> (various forms of filtering), so this is perfectly OK. And now I don't have
> half of Hadoop in my dependencies. \o/
> But yes, definitely +1 on a self-contained vectorised Parquet reader.
>
> -J
>
>
> On Thu, May 23, 2019 at 2:58 PM Wes McKinney  wrote:
>
> > hi Joris,
> >
> > The Apache Parquet mailing list is
> >
> > d...@parquet.apache.org
> >
> > I'm copying the list here
> >
> > AFAIK parquet-mr doesn't feature vectorized reading (for Arrow or
> > otherwise). There are some vectorized Java-based readers in the wild:
> > in Dremio [1] and Apache Spark, at least. I'm interested to see a
> > reusable library that supports vectorized Arrow reads in Java.
> >
> > - Wes
> >
> > [1]: https://github.com/dremio/dremio-oss
> >
> > On Thu, May 23, 2019 at 8:54 AM Joris Peeters
> >  wrote:
> > >
> > > Hello,
> > >
> > > I'm trying to read a Parquet file from disk into Arrow in memory, in
> > Scala.
> > > I'm wondering what the most efficient approach is, especially for the
> > > reading part. I'm aware that Parquet reading is perhaps beyond the scope
> > of
> > > this mailing list but,
> > >
> > > - I believe Arrow and Parquet are closely intertwined these days?
> > > - I can't find an appropriate Parquet mailing list.
> > >
> > > Any pointers would be appreciated!
> > >
> > > Below is the code I currently have. My concern is that this alone already
> > > takes about 2s, whereas "pq.read_pandas(the_file_path).to_pandas()" takes
> > > ~=100ms in Python. So I suspect I'm not doing this in the most efficient
> > > way possible ... The Parquet data holds 1570150 rows, with 14 columns of
> > > various types, and takes 15MB on disk.
> > >
> > > import org.apache.hadoop.conf.Configuration
> > > import org.apache.parquet.column.ColumnDescriptor
> > > import
> > org.apache.parquet.example.data.simple.convert.GroupRecordConverter
> > > import org.apache.parquet.format.converter.ParquetMetadataConverter
> > > import org.apache.parquet.hadoop.{ParquetFileReader}
> > > import org.apache.parquet.io.ColumnIOFactory
> > >
> > > ...
> > >
> > > val path: Path = Paths.get("C:\\item.pq")
> > > val jpath = new org.apache.hadoop.fs.Path(path.toFile.getAbsolutePath)
> > > val conf = new Configuration()
> > >
> > > val readFooter = ParquetFileReader.readFooter(conf, jpath,
> > > ParquetMetadataConverter.NO_FILTER)
> > > val schema = readFooter.getFileMetaData.getSchema
> > > val r = ParquetFileReader.open(conf, jpath)
> > >
> > > val pages = r.readNextRowGroup()
> > > val rows = pages.getRowCount
> > >
> > > val columnIO = new ColumnIOFactory().getColumnIO(schema)
> > > val recordReader = columnIO.getRecordReader(pages, new
> > > GroupRecordConverter(schema))
> > >
> > > // This takes about 2s
> > > (1 to rows.toInt).map { i =>
> > >   val group = recordReader.read
> > >   // Just read first column for now ...
> > >   val x = group.getLong(0,0)
> > > }
> > >
> > > ...
> > >
> > > As this will be in the hot path of my code, I'm quite keen to make it
> > > as fast as possible. Note that the eventual objective is to build
> > > Arrow data. I was assuming there would be a way to quickly load the
> > > columns. I suspect the loop over the rows, building row-based records,
> > > is causing a lot of overhead, but can't seem to find another way.
> > >
> > >
> > > Thanks,
> > >
> > > -J
> >


Re: [Python] Is there a way to specify a column as non-nullable with parquet.write_table?

2019-05-23 Thread Wes McKinney
Yes, but you will need to resolve

https://issues.apache.org/jira/browse/ARROW-5169

write_table should respect the field-level nullability in the schema
of the Table you pass

On Thu, May 23, 2019 at 10:34 AM Tim Swast  wrote:
>
> I'm currently using parquet as the intermediate format when uploading a
> pandas DataFrame to Google BigQuery. We encounter a problem when trying to
> append a parquet file to a table with required fields (issue:
> https://github.com/googleapis/google-cloud-python/issues/8093).
>
> Is there a way to mark fields as required / non-nullable in parquet files?
> If there is, is there a way to set that option with
> pyarrow.parquet.write_table?
>
> https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html#pyarrow.parquet.write_table
>
> *  •  **Tim Swast*
> *  •  *Software Friendliness Engineer
> *  •  *Google Cloud Developer Relations
> *  •  *Seattle, WA, USA


[Python] Is there a way to specify a column as non-nullable with parquet.write_table?

2019-05-23 Thread Tim Swast
I'm currently using parquet as the intermediate format when uploading a
pandas DataFrame to Google BigQuery. We encounter a problem when trying to
append a parquet file to a table with required fields (issue:
https://github.com/googleapis/google-cloud-python/issues/8093).

Is there a way to mark fields as required / non-nullable in parquet files?
If there is, is there a way to set that option with
pyarrow.parquet.write_table?

https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html#pyarrow.parquet.write_table

*  •  **Tim Swast*
*  •  *Software Friendliness Engineer
*  •  *Google Cloud Developer Relations
*  •  *Seattle, WA, USA


[jira] [Created] (ARROW-5407) [C++] Integration test Travis CI entry builds many unnecessary targets

2019-05-23 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-5407:
---

 Summary: [C++] Integration test Travis CI entry builds many 
unnecessary targets
 Key: ARROW-5407
 URL: https://issues.apache.org/jira/browse/ARROW-5407
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Wes McKinney
 Fix For: 0.14.0


Only the IPC and Flight integration test targets are needed to run the tests. 
It appears that all targets including all unit tests are being built in Travis



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Java/Scala: efficient reading of Parquet into Arrow?

2019-05-23 Thread Joris Peeters
Cool, thanks. I think we'll just go with reading LZ4 compressed Arrow
directly from disk then, and by-pass Parquet altogether.
The compressed Arrow files are about 20% larger than the PQ files, but
getting it into some useful form in memory is almost on par with pandas.

At the moment, I don't need the additional benefits parquet would give
(various forms of filtering), so this is perfectly OK. And now I don't have
half of Hadoop in my dependencies. \o/
But yes, definitely +1 on a self-contained vectorised Parquet reader.

-J


On Thu, May 23, 2019 at 2:58 PM Wes McKinney  wrote:

> hi Joris,
>
> The Apache Parquet mailing list is
>
> d...@parquet.apache.org
>
> I'm copying the list here
>
> AFAIK parquet-mr doesn't feature vectorized reading (for Arrow or
> otherwise). There are some vectorized Java-based readers in the wild:
> in Dremio [1] and Apache Spark, at least. I'm interested to see a
> reusable library that supports vectorized Arrow reads in Java.
>
> - Wes
>
> [1]: https://github.com/dremio/dremio-oss
>
> On Thu, May 23, 2019 at 8:54 AM Joris Peeters
>  wrote:
> >
> > Hello,
> >
> > I'm trying to read a Parquet file from disk into Arrow in memory, in
> Scala.
> > I'm wondering what the most efficient approach is, especially for the
> > reading part. I'm aware that Parquet reading is perhaps beyond the scope
> of
> > this mailing list but,
> >
> > - I believe Arrow and Parquet are closely intertwined these days?
> > - I can't find an appropriate Parquet mailing list.
> >
> > Any pointers would be appreciated!
> >
> > Below is the code I currently have. My concern is that this alone already
> > takes about 2s, whereas "pq.read_pandas(the_file_path).to_pandas()" takes
> > ~=100ms in Python. So I suspect I'm not doing this in the most efficient
> > way possible ... The Parquet data holds 1570150 rows, with 14 columns of
> > various types, and takes 15MB on disk.
> >
> > import org.apache.hadoop.conf.Configuration
> > import org.apache.parquet.column.ColumnDescriptor
> > import
> org.apache.parquet.example.data.simple.convert.GroupRecordConverter
> > import org.apache.parquet.format.converter.ParquetMetadataConverter
> > import org.apache.parquet.hadoop.{ParquetFileReader}
> > import org.apache.parquet.io.ColumnIOFactory
> >
> > ...
> >
> > val path: Path = Paths.get("C:\\item.pq")
> > val jpath = new org.apache.hadoop.fs.Path(path.toFile.getAbsolutePath)
> > val conf = new Configuration()
> >
> > val readFooter = ParquetFileReader.readFooter(conf, jpath,
> > ParquetMetadataConverter.NO_FILTER)
> > val schema = readFooter.getFileMetaData.getSchema
> > val r = ParquetFileReader.open(conf, jpath)
> >
> > val pages = r.readNextRowGroup()
> > val rows = pages.getRowCount
> >
> > val columnIO = new ColumnIOFactory().getColumnIO(schema)
> > val recordReader = columnIO.getRecordReader(pages, new
> > GroupRecordConverter(schema))
> >
> > // This takes about 2s
> > (1 to rows.toInt).map { i =>
> >   val group = recordReader.read
> >   // Just read first column for now ...
> >   val x = group.getLong(0,0)
> > }
> >
> > ...
> >
> > As this will be in the hot path of my code, I'm quite keen to make it
> > as fast as possible. Note that the eventual objective is to build
> > Arrow data. I was assuming there would be a way to quickly load the
> > columns. I suspect the loop over the rows, building row-based records,
> > is causing a lot of overhead, but can't seem to find another way.
> >
> >
> > Thanks,
> >
> > -J
>


[jira] [Created] (ARROW-5406) enable Subscribe and GetNotification from Java

2019-05-23 Thread Tim Emerick (JIRA)
Tim Emerick created ARROW-5406:
--

 Summary: enable Subscribe and GetNotification from Java
 Key: ARROW-5406
 URL: https://issues.apache.org/jira/browse/ARROW-5406
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++ - Plasma, Java
Reporter: Tim Emerick


Currently, these functions exist in the cpp API, but are not exposed via JNI.

If this is a feature that is in line with the project direction, I would be 
happy to implement it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5405) [Documentation] Move integration testing documentation to Sphinx docs, add instructions for JavaScript

2019-05-23 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-5405:
---

 Summary: [Documentation] Move integration testing documentation to 
Sphinx docs, add instructions for JavaScript
 Key: ARROW-5405
 URL: https://issues.apache.org/jira/browse/ARROW-5405
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Documentation
Reporter: Wes McKinney
 Fix For: 0.14.0


I noticed that JavaScript information is not in integration/README.md. It would 
be a good opportunity to migrate over this to the docs/source/developers 
directory



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Java/Scala: efficient reading of Parquet into Arrow?

2019-05-23 Thread Wes McKinney
hi Joris,

The Apache Parquet mailing list is

d...@parquet.apache.org

I'm copying the list here

AFAIK parquet-mr doesn't feature vectorized reading (for Arrow or
otherwise). There are some vectorized Java-based readers in the wild:
in Dremio [1] and Apache Spark, at least. I'm interested to see a
reusable library that supports vectorized Arrow reads in Java.

- Wes

[1]: https://github.com/dremio/dremio-oss

On Thu, May 23, 2019 at 8:54 AM Joris Peeters
 wrote:
>
> Hello,
>
> I'm trying to read a Parquet file from disk into Arrow in memory, in Scala.
> I'm wondering what the most efficient approach is, especially for the
> reading part. I'm aware that Parquet reading is perhaps beyond the scope of
> this mailing list but,
>
> - I believe Arrow and Parquet are closely intertwined these days?
> - I can't find an appropriate Parquet mailing list.
>
> Any pointers would be appreciated!
>
> Below is the code I currently have. My concern is that this alone already
> takes about 2s, whereas "pq.read_pandas(the_file_path).to_pandas()" takes
> ~=100ms in Python. So I suspect I'm not doing this in the most efficient
> way possible ... The Parquet data holds 1570150 rows, with 14 columns of
> various types, and takes 15MB on disk.
>
> import org.apache.hadoop.conf.Configuration
> import org.apache.parquet.column.ColumnDescriptor
> import org.apache.parquet.example.data.simple.convert.GroupRecordConverter
> import org.apache.parquet.format.converter.ParquetMetadataConverter
> import org.apache.parquet.hadoop.{ParquetFileReader}
> import org.apache.parquet.io.ColumnIOFactory
>
> ...
>
> val path: Path = Paths.get("C:\\item.pq")
> val jpath = new org.apache.hadoop.fs.Path(path.toFile.getAbsolutePath)
> val conf = new Configuration()
>
> val readFooter = ParquetFileReader.readFooter(conf, jpath,
> ParquetMetadataConverter.NO_FILTER)
> val schema = readFooter.getFileMetaData.getSchema
> val r = ParquetFileReader.open(conf, jpath)
>
> val pages = r.readNextRowGroup()
> val rows = pages.getRowCount
>
> val columnIO = new ColumnIOFactory().getColumnIO(schema)
> val recordReader = columnIO.getRecordReader(pages, new
> GroupRecordConverter(schema))
>
> // This takes about 2s
> (1 to rows.toInt).map { i =>
>   val group = recordReader.read
>   // Just read first column for now ...
>   val x = group.getLong(0,0)
> }
>
> ...
>
> As this will be in the hot path of my code, I'm quite keen to make it
> as fast as possible. Note that the eventual objective is to build
> Arrow data. I was assuming there would be a way to quickly load the
> columns. I suspect the loop over the rows, building row-based records,
> is causing a lot of overhead, but can't seem to find another way.
>
>
> Thanks,
>
> -J


Java/Scala: efficient reading of Parquet into Arrow?

2019-05-23 Thread Joris Peeters
Hello,

I'm trying to read a Parquet file from disk into Arrow in memory, in Scala.
I'm wondering what the most efficient approach is, especially for the
reading part. I'm aware that Parquet reading is perhaps beyond the scope of
this mailing list but,

- I believe Arrow and Parquet are closely intertwined these days?
- I can't find an appropriate Parquet mailing list.

Any pointers would be appreciated!

Below is the code I currently have. My concern is that this alone already
takes about 2s, whereas "pq.read_pandas(the_file_path).to_pandas()" takes
~=100ms in Python. So I suspect I'm not doing this in the most efficient
way possible ... The Parquet data holds 1570150 rows, with 14 columns of
various types, and takes 15MB on disk.

import org.apache.hadoop.conf.Configuration
import org.apache.parquet.column.ColumnDescriptor
import org.apache.parquet.example.data.simple.convert.GroupRecordConverter
import org.apache.parquet.format.converter.ParquetMetadataConverter
import org.apache.parquet.hadoop.{ParquetFileReader}
import org.apache.parquet.io.ColumnIOFactory

...

val path: Path = Paths.get("C:\\item.pq")
val jpath = new org.apache.hadoop.fs.Path(path.toFile.getAbsolutePath)
val conf = new Configuration()

val readFooter = ParquetFileReader.readFooter(conf, jpath,
ParquetMetadataConverter.NO_FILTER)
val schema = readFooter.getFileMetaData.getSchema
val r = ParquetFileReader.open(conf, jpath)

val pages = r.readNextRowGroup()
val rows = pages.getRowCount

val columnIO = new ColumnIOFactory().getColumnIO(schema)
val recordReader = columnIO.getRecordReader(pages, new
GroupRecordConverter(schema))

// This takes about 2s
(1 to rows.toInt).map { i =>
  val group = recordReader.read
  // Just read first column for now ...
  val x = group.getLong(0,0)
}

...

As this will be in the hot path of my code, I'm quite keen to make it
as fast as possible. Note that the eventual objective is to build
Arrow data. I was assuming there would be a way to quickly load the
columns. I suspect the loop over the rows, building row-based records,
is causing a lot of overhead, but can't seem to find another way.


Thanks,

-J


Re: memory mapped IPC File of RecordBatches?

2019-05-23 Thread Wes McKinney
OK. Can you open a JIRA about fixing this? I don't recall the
rationale for using MAP_PRIVATE to begin with, and since the behavior
is unspecified on Linux it would be better to be consistent across
platforms

On Wed, May 22, 2019 at 11:02 PM John Muehlhausen  wrote:
>
> Well, it works fine on Linux... and the Linux mmap man page seems to
> indicate you are right about MAP_PRIVATE:
>
> "It is unspecified whether changes made to the file after the mmap() call
> are visible in the mapped region."
>
> The Mac man page has no such note.
>
> Changing it to MAP_SHARED makes it work as expected on MacOS.  Still odd
> that the changes are only sometimes visible ... but I guess that is
> compatible with it being "unspecified."
>
> -John
>
> On Wed, May 22, 2019 at 8:56 PM John Muehlhausen  wrote:
>
> > I'll mess with this on various platforms and report back.  Thanks
> >
> > On Wed, May 22, 2019 at 8:42 PM Wes McKinney  wrote:
> >
> >> I tried locally and am not seeing this behavior
> >>
> >> In [10]: source = pa.memory_map('/tmp/test.batch')
> >>
> >> In [11]: reader=pa.ipc.open_stream(source)
> >>
> >> In [12]: batch = reader.get_next_batch()
> >> /home/wesm/miniconda/envs/arrow-3.7/bin/ipython:1: FutureWarning:
> >> Please use read_next_batch instead of get_next_batch
> >>   #!/home/wesm/miniconda/envs/arrow-3.7/bin/python
> >>
> >> In [13]: batch.to_pandas()
> >> Out[13]:
> >>field1
> >> 0 1.0
> >> 1 NaN
> >>
> >> Now ran dd to overwrite the file contents
> >>
> >> In [14]: batch.to_pandas()
> >> Out[14]:
> >> field1
> >> 0  NaN
> >> 1 -245785081.0
> >>
> >> On Wed, May 22, 2019 at 8:34 PM John Muehlhausen  wrote:
> >> >
> >> > I don't think that is it.  I changed my mmap to MAP_PRIVATE in the first
> >> > raw mmap test and the dd changes are still visible.  I also changed to
> >> > storing the stream format instead of the file format and got the same
> >> > result.
> >> >
> >> > Where is the code that constructs a buffer/array by pointing it into the
> >> > mmap space instead of by allocating space?  Sorry I'm so confused about
> >> > this, I just don't see how it is supposed to work.
> >> >
> >> > On Wed, May 22, 2019 at 7:58 PM Wes McKinney 
> >> wrote:
> >> >
> >> > > It seems this could be due to our use of MAP_PRIVATE for read-only
> >> memory
> >> > > maps
> >> > >
> >> > >
> >> https://github.com/apache/arrow/blob/master/cpp/src/arrow/io/file.cc#L393
> >> > >
> >> > > Some more investigation would be required
> >> > >
> >> > > On Wed, May 22, 2019 at 7:43 PM John Muehlhausen  wrote:
> >> > > >
> >> > > > Is there an example somewhere of referring to the RecordBatch data
> >> in a
> >> > > memory-mapped IPC File in a zero-copy manner?
> >> > > >
> >> > > > I tried to do this in Python and must be doing something wrong.  (I
> >> > > don't really care whether the example is Python or C++)
> >> > > >
> >> > > > In the attached test, when I get to the first prompt and hit
> >> return, I
> >> > > get the same content again.  Likewise when I hit return on the second
> >> > > prompt I get the same content again.
> >> > > >
> >> > > > However, if before hitting return on the first prompt I issue:
> >> > > >
> >> > > > dd conv=notrunc if=/dev/urandom of=/tmp/test.batch bs=478 count=1
> >> > > >
> >> > > >
> >> > > > i.e. overwrite the contents of the file, I get a garbled result.
> >> > > (Replace 478 with the size of your file.)
> >> > > >
> >> > > > However, if I wait until the second prompt to issue the dd command
> >> > > before hitting return, I do not get an error.  Instead,
> >> batch.to_pandas()
> >> > > works the same both before and after the data is overwritten.  This
> >> was not
> >> > > expected as I thought that the batch object was looking at the file
> >> > > in-place, i.e. zero-copy?
> >> > > >
> >> > > > Am I tying together the memory-mapping and the batch construction
> >> in the
> >> > > wrong way?
> >> > > >
> >> > > > Thanks,
> >> > > > John
> >> > >
> >>
> >


[jira] [Created] (ARROW-5404) [C++] nonstd::string_view conflicts with std::string_view in c++17

2019-05-23 Thread Benjamin Kietzman (JIRA)
Benjamin Kietzman created ARROW-5404:


 Summary: [C++] nonstd::string_view conflicts with std::string_view 
in c++17
 Key: ARROW-5404
 URL: https://issues.apache.org/jira/browse/ARROW-5404
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Benjamin Kietzman
Assignee: Benjamin Kietzman


>From GitHub issue https://github.com/apache/arrow/issues/4294

Our vendored string_view header will forward to {{std::string_view}} if that is 
available. This can produce ABI conflicts and build errors when arrow or 
applications which use it are built against c++17.

I think it's acceptable to just force usage of nonstd's implementation with 
{{#define nssv_CONFIG_SELECT_STRING_VIEW nssv_STRING_VIEW_NONSTD}}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Gandiva User Defined Functions

2019-05-23 Thread Praveen Kumar
Hi Sun,

Starting a thread on the mailing list around the query you had -
https://github.com/apache/arrow/issues/4375.

Currently Gandiva does not support a user defined repository, we want to
implement it sometime in the future but i am not sure when we will pick it
up. In case you want to go ahead on the same, please share a specification
around the same that others on the list can review and agree upon.

Among the functions you called out, I think the first two should be already
supported. I do not think we support the other two, but it is easier to add
functions than to support an user registry at this point.

Thx.


Re: A couple of questions about pyarrow.parquet

2019-05-23 Thread Uwe L. Korn
Hello Ted,

regarding predicate pushdown in Python, have a look at my unfinished PR at 
https://github.com/apache/arrow/pull/2623. This was stopped since we were 
missing native filter in Arrow. The requirements for that have now been 
implemented and we could probably reactivate the PR.

Uwe

On Sat, May 18, 2019, at 3:53 AM, Ted Gooch wrote:
> Thanks Micah and Wes.
> 
> Definitely interested in the *Predicate Pushdown* and *Schema inference,
> schema-on-read, and schema normalization *sections.
> 
> On Fri, May 17, 2019 at 12:47 PM Wes McKinney  wrote:
> 
> > Please see also
> >
> >
> > https://docs.google.com/document/d/1bVhzifD38qDypnSjtf8exvpP3sSB5x_Kw9m-n66FB2c/edit?usp=drivesdk
> >
> > And prior mailing list discussion. I will comment in more detail on the
> > other items later
> >
> > On Fri, May 17, 2019, 2:44 PM Micah Kornfield 
> > wrote:
> >
> > > I can't help on the first question.
> > >
> > > Regarding push-down predicates, there is an open JIRA [1] to do just that
> > >
> > > [1] https://issues.apache.org/jira/browse/PARQUET-473
> > > <
> > >
> > https://issues.apache.org/jira/browse/PARQUET-473?jql=project%20in%20(PARQUET%2C%20ARROW)%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20text%20~%20%22pushdown%22
> > > >
> > >
> > > On Fri, May 17, 2019 at 11:48 AM Ted Gooch  wrote:
> > >
> > > > Hi,
> > > >
> > > > I've been doing some work trying to get the parquet read path going for
> > > the
> > > > python iceberg 
> > library.  I
> > > > have two questions that I couldn't get figured out, and was hoping I
> > > could
> > > > get some guidance from the list here.
> > > >
> > > > First, I'd like to create a ParquetSchema->IcebergSchema converter, but
> > > it
> > > > appears that only limited information is available in the ColumnSchema
> > > > passed back to the python client[2]:
> > > >
> > > > 
> > > >   name: key
> > > >   path: m.map.key
> > > >   max_definition_level: 2
> > > >   max_repetition_level: 1
> > > >   physical_type: BYTE_ARRAY
> > > >   logical_type: UTF8
> > > > 
> > > >   name: key
> > > >   path: m.map.value.map.key
> > > >   max_definition_level: 4
> > > >   max_repetition_level: 2
> > > >   physical_type: BYTE_ARRAY
> > > >   logical_type: UTF8
> > > > 
> > > >   name: value
> > > >   path: m.map.value.map.value
> > > >   max_definition_level: 5
> > > >   max_repetition_level: 2
> > > >   physical_type: BYTE_ARRAY
> > > >   logical_type: UTF8
> > > >
> > > >
> > > > where physical_type and logical_type are both strings[1].  The arrow
> > > schema
> > > > I can get from *to_arrow_schema *looks to be more expressive(although
> > may
> > > > be I just don't understand the parquet format well enough):
> > > >
> > > > m: struct > list > > > struct not null>>> not null>>
> > > >   child 0, map: list > > list > > > struct not null>>> not null>
> > > >   child 0, map: struct > > > struct not null>>>
> > > >   child 0, key: string
> > > >   child 1, value: struct > > value:
> > > > string> not null>>
> > > >   child 0, map: list > string>
> > > > not null>
> > > >   child 0, map: struct
> > > >   child 0, key: string
> > > >   child 1, value: string
> > > >
> > > >
> > > > It seems like I can infer the info from the name/path, but is there a
> > > more
> > > > direct way of getting the detailed parquet schema information?
> > > >
> > > > Second question, is there a way to push record level filtering into the
> > > > parquet reader, so that the parquet reader only reads in values that
> > > match
> > > > a given predicate expression? Predicate expressions would be simple
> > > > field-to-literal comparisons(>,>=,==,<=,<, !=, is null, is not null)
> > > > connected with logical operators(AND, OR, NOT).
> > > >
> > > > I've seen that after reading-in I can use the filtering language in
> > > > gandiva[3] to get filtered record-batches, but was looking for
> > somewhere
> > > > lower in the stack if possible.
> > > >
> > > >
> > > >
> > > > [1]
> > > >
> > > >
> > >
> > https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pyx#L661-L667
> > > > [2] Spark/Hive Table DDL for this parquet file looks like:
> > > > CREATE TABLE `iceberg`.`nested_map` (
> > > > m map>)
> > > > [3]
> > > >
> > > >
> > >
> > https://github.com/apache/arrow/blob/master/python/pyarrow/tests/test_gandiva.py#L86-L100
> > > >
> > >
> >
>


[jira] [Created] (ARROW-5402) [Plasma] Pin objects in plasma store

2019-05-23 Thread Zhijun Fu (JIRA)
Zhijun Fu created ARROW-5402:


 Summary: [Plasma] Pin objects in plasma store
 Key: ARROW-5402
 URL: https://issues.apache.org/jira/browse/ARROW-5402
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Zhijun Fu
Assignee: Zhijun Fu


[https://github.com/apache/arrow/issues/4368]

Sometimes we want to "pin" an object in plasma store - we don't want this 
object to be deleted even though there's nobody that's currently referencing 
it. In this case, we can specify a flag when creating the object so that it 
won't be deleted by LRU cache when its refcnt drops to 0, and can only be 
deleted by an explicit {{Delete()}} call.

Currently, we found that an actor FO problem. The actor creation task depends 
on a plasma object put by user. After the the actor running for a long time, 
the object will be deleted by plasma LRU. Then, an Actor FO happens, the 
creation task cannot find the object put by user, so the FO is hanging forever.

Would this make sense to you?

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5401) [CI] [C++] Print ccache statistics on Travis-CI

2019-05-23 Thread Antoine Pitrou (JIRA)
Antoine Pitrou created ARROW-5401:
-

 Summary: [CI] [C++] Print ccache statistics on Travis-CI
 Key: ARROW-5401
 URL: https://issues.apache.org/jira/browse/ARROW-5401
 Project: Apache Arrow
  Issue Type: Wish
  Components: C++, Continuous Integration
Reporter: Antoine Pitrou


This would allow to know if compilation caching is really in effect.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5400) [Rust] Test/ensure that reader and writer support zero-length record batches

2019-05-23 Thread Neville Dipale (JIRA)
Neville Dipale created ARROW-5400:
-

 Summary: [Rust] Test/ensure that reader and writer support 
zero-length record batches
 Key: ARROW-5400
 URL: https://issues.apache.org/jira/browse/ARROW-5400
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Reporter: Neville Dipale






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5399) [Rust] [Testing] Add IPC test files to arrow-testing

2019-05-23 Thread Neville Dipale (JIRA)
Neville Dipale created ARROW-5399:
-

 Summary: [Rust] [Testing] Add IPC test files to arrow-testing
 Key: ARROW-5399
 URL: https://issues.apache.org/jira/browse/ARROW-5399
 Project: Apache Arrow
  Issue Type: Sub-task
  Components: Rust
Reporter: Neville Dipale


We're generating a lot of files for testing, which should ideally live in 
arrow-testing



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)