[jira] [Created] (ARROW-3863) [GLib] Use travis_retry with brew bundle command

2018-11-23 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-3863:
---

 Summary: [GLib] Use travis_retry with brew bundle command
 Key: ARROW-3863
 URL: https://issues.apache.org/jira/browse/ARROW-3863
 Project: Apache Arrow
  Issue Type: Improvement
  Components: GLib
Reporter: Wes McKinney
 Fix For: 0.12.0


This has been flaky lately, see

https://travis-ci.org/apache/arrow/jobs/458878912#L1844

It may not make the errors go away, but it might be worth adding retry logic to 
try a few times before giving up



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3862) Improve dependencies download script

2018-11-23 Thread Francois Saint-Jacques (JIRA)
Francois Saint-Jacques created ARROW-3862:
-

 Summary: Improve dependencies download script 
 Key: ARROW-3862
 URL: https://issues.apache.org/jira/browse/ARROW-3862
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Developer Tools
Reporter: Francois Saint-Jacques






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-3861) ParquetDataset().read columns argument always returns partition column

2018-11-23 Thread Christian Thiel (JIRA)
Christian Thiel created ARROW-3861:
--

 Summary: ParquetDataset().read columns argument always returns 
partition column
 Key: ARROW-3861
 URL: https://issues.apache.org/jira/browse/ARROW-3861
 Project: Apache Arrow
  Issue Type: Bug
Reporter: Christian Thiel


I just noticed that no matter which columns are specified on load of a dataset, 
the partition column is always returned. This might lead to strange behaviour, 
as the resulting dataframe has more than the expected columns:
{code}
import dask as da
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
import os
import numpy as np
import shutil

PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/'

if os.path.exists(PATH_PYARROW_MANUAL):
shutil.rmtree(PATH_PYARROW_MANUAL)
os.mkdir(PATH_PYARROW_MANUAL)

arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan])
strings = np.array([np.nan, np.nan, 'a', 'b'])

df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column'])
df.index.name='DPRD_ID'
df['arrays'] = pd.Series(arrays)
df['strings'] = pd.Series(strings)

my_schema = pa.schema([('DPRD_ID', pa.int64()),
   ('partition_column', pa.int32()),
   ('arrays', pa.list_(pa.int32())),
   ('strings', pa.string()),
   ('new_column', pa.string())])

table = pa.Table.from_pandas(df, schema=my_schema)
pq.write_to_dataset(table, root_path=PATH_PYARROW_MANUAL, 
partition_cols=['partition_column'])

df_pq = pq.ParquetDataset(PATH_PYARROW_MANUAL).read(columns=['DPRD_ID', 
'strings']).to_pandas()
# pd.read_parquet(PATH_PYARROW_MANUAL, columns=['DPRD_ID', 'strings'], 
engine='pyarrow')
df_pq
{code}
df_pq has column `partition_column`



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [Go] High memory usage on CSV read into table

2018-11-23 Thread Sebastien Binet
On Mon, Nov 19, 2018 at 11:29 PM Wes McKinney  wrote:

> That seems buggy then. There is only 4.125 bytes of overhead per
> string value on average (a 32-bit offset, plus a valid bit)
> On Mon, Nov 19, 2018 at 5:02 PM Daniel Harper 
> wrote:
> >
> > Uncompressed
> >
> > $ ls -la concurrent_streams.csv
> > -rw-r--r-- 1 danielharper 112M Nov 16 19:21 concurrent_streams.csv
> >
> > $ wc -l concurrent_streams.csv
> >  1007481 concurrent_streams.csv
> >
> >
> > Daniel Harper
> > http://djhworld.github.io
> >
> >
> > On Mon, 19 Nov 2018 at 21:55, Wes McKinney  wrote:
> >
> > > I'm curious how the file is only 100MB if it's producing ~6GB of
> > > strings in memory. Is it compressed?
> > > On Mon, Nov 19, 2018 at 4:48 PM Daniel Harper 
> > > wrote:
> > > >
> > > > Thanks,
> > > >
> > > > I've tried the new code and that seems to have shaved about 1GB of
> memory
> > > > off, so the heap is about 8.84GB now, here is the updated pprof
> output
> > > > https://i.imgur.com/itOHqBf.png
> > > >
> > > > It looks like the majority of allocations are in the
> memory.GoAllocator
> > > >
> > > > (pprof) top
> > > > Showing nodes accounting for 8.84GB, 100% of 8.84GB total
> > > > Showing top 10 nodes out of 41
> > > >   flat  flat%   sum%cum   cum%
> > > > 4.24GB 47.91% 47.91% 4.24GB 47.91%
> > > > github.com/apache/arrow/go/arrow/memory.(*GoAllocator).Allocate
> > > > 2.12GB 23.97% 71.88% 2.12GB 23.97%
> > > > github.com/apache/arrow/go/arrow/memory.NewResizableBuffer (inline)
> > > > 1.07GB 12.07% 83.95% 1.07GB 12.07%
> > > > github.com/apache/arrow/go/arrow/array.NewData
> > > > 0.83GB  9.38% 93.33% 0.83GB  9.38%
> > > > github.com/apache/arrow/go/arrow/array.NewStringData
> > > > 0.33GB  3.69% 97.02% 1.31GB 14.79%
> > > > github.com/apache/arrow/go/arrow/array.(*BinaryBuilder).newData
> > > > 0.18GB  2.04% 99.06% 0.18GB  2.04%
> > > > github.com/apache/arrow/go/arrow/array.NewChunked
> > > > 0.07GB  0.78% 99.85% 0.07GB  0.78%
> > > > github.com/apache/arrow/go/arrow/array.NewInt64Data
> > > > 0.01GB  0.15%   100% 0.21GB  2.37%
> > > > github.com/apache/arrow/go/arrow/array.(*Int64Builder).newData
> > > >  0 0%   100%6GB 67.91%
> > > > github.com/apache/arrow/go/arrow/array.(*BinaryBuilder).Append
> > > >  0 0%   100% 4.03GB 45.54%
> > > > github.com/apache/arrow/go/arrow/array.(*BinaryBuilder).Reserve
> > > >
> > > >
> > > > I'm a bit busy at the moment but I'll probably repeat the same test
> on
> > > the
> > > > other Arrow implementations (e.g. Java) to see if they allocate a
> similar
> > > > amount.
>

I've implemented chunking over there:

- https://github.com/apache/arrow/pull/3019

could you try with a couple of chunking values?
e.g.:
- csv.WithChunk(-1): reads the whole file into memory, creates one big
record
- csv.WithChunk(nrows/10): creates 10 records

also, it would be great to try to disentangle the memory usage of the "CSV
reading part" from the "Table creation" one:
- have some perf numbers w/o storing all these Records into a []Record
slice,
- have some perf numbers w/ only storing these Records into a []Record
slice,
- have some perf numbers w/ storing the records into the slice + creating
the Table.

hth,
-s