from:"ALBERTO Bocchinfuso"

I: [Scala/Java] Writing a RecordBatch on a file using scala

2018-10-06 Thread ALBERTO Bocchinfuso

Hi guys,

I’m trying to write a RecordBatch on a file using scala. To do so, I use the 
java API.
Aside from all the imports, my code is:

imports …
object Test {
 def main(args: Array[String]) {
 val allocator = new RootAllocator(8192);

 val fields = List[Field](new Field("names", new 
FieldType(false, ArrowType.Utf8.INSTANCE, null), null), new Field("names", new 
FieldType(false, new ArrowType.Int(64, false), null), null));

 val fieldVectors = 
List[FieldVector](fields(0).getFieldType().createNewSingleVector("names", 
allocator, null), fields(1).getFieldType().createNewSingleVector("numbers", 
allocator, null));

 
fieldVectors(0).initializeChildrenFromFields(fields(0).getChildren());
 
fieldVectors(0).initializeChildrenFromFields(fields(1).getChildren());

val names = List[String]("Name1", "Name2", "Name3", "Name4", "Name5", "Name6", 
"Name7", "Name8");
 val numbers = List[Int](1, 2, 3, 4, 5, 6, 7, 8);
 val bitmap: ArrowBuf = allocator.buffer(1);
  bitmap.writeByte(255);

 var fos: FileOutputStream = new 
FileOutputStream("TEST");
  val vsr: VectorSchemaRoot = new VectorSchemaRoot(fields.asJava, 
fieldVectors.asJava, 8);
 val afw: ArrowFileWriter = new ArrowFileWriter(vsr, 
null , fos.getChannel());

 val namesBuf: ArrowBuf = allocator.buffer(2048);
 for(i <- names){
  namesBuf.writeBytes(i.getBytes());
 }

 var numbersBuf: ArrowBuf = allocator.buffer(2048);
 for(i <- numbers){
numbersBuf.writeInt(i);
}

 val btch = new ArrowRecordBatch(8192, List(new 
ArrowFieldNode(8, 0), new ArrowFieldNode(8, 0)).asJava, List(bitmap, namesBuf, 
bitmap, numbersBuf).asJava);

 val loader: VectorLoader = new VectorLoader (vsr);

 afw.start();
 loader.load(btch);
 afw.writeBatch();
 afw.end();
 }
}

After many debugging messages I get:

[error] (run-main-e) java.util.NoSuchElementException: next on empty iterator
[error] java.util.NoSuchElementException: next on empty iterator
[error] at scala.collection.Iterator$$anon$2.next(Iterator.scala:38)
[error] at scala.collection.Iterator$$anon$2.next(Iterator.scala:36)
[error] at 
scala.collection.LinearSeqLike$$anon$1.next(LinearSeqLike.scala:47)
[error] at 
scala.collection.convert.Wrappers$IteratorWrapper.next(Wrappers.scala:28)
[error] at 
org.apache.arrow.vector.VectorLoader.loadBuffers(VectorLoader.java:76)
[error] at 
org.apache.arrow.vector.VectorLoader.load(VectorLoader.java:61)
[error] at Test$.main(RecordBatchWriteTest.scala:53)
[error] at Test.main(RecordBatchWriteTest.scala)
[error] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method)
[error] at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
[error] at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[error] at java.lang.reflect.Method.invoke(Method.java:498)
[error] at sbt.Run.invokeMain(Run.scala:93)
[error] at sbt.Run.run0(Run.scala:87)
[error] at sbt.Run.execute$1(Run.scala:65)
[error] at sbt.Run.$anonfun$run$4(Run.scala:77)
[error] at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
[error] at 
sbt.util.InterfaceUtil$$anon$1.get(InterfaceUtil.scala:10)
[error] at sbt.TrapExit$App.run(TrapExit.scala:252)
[error] at java.lang.Thread.run(Thread.java:745)
[error] java.lang.RuntimeException: Nonzero exit code: 1
[error] at sbt.Run$.executeTrapExit(Run.scala:124)
[error] at sbt.Run.run(Run.scala:77)
[error] at sbt.Defaults$.$anonfun$bgRunTask$5(Defaults.scala:1185)
[error] at 
sbt.Defaults$.$anonfun$bgRunTask$5$adapted(Defaults.scala:1180)
[error] at 
sbt.internal.BackgroundThreadPool.$anonfun$run$1(DefaultBackgroundJobService.scala:366)
[error] at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
[error] at scala.util.Try$.apply(Try.scala:209)
[error] at 
sbt.internal.BackgroundThreadPool$BackgroundRunnable.run(DefaultBackgroundJobService.scala:289)
[error] at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[error] at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[error] at java.lang.Thread.run(Th

[Scala] Plasma client for java not connecting

2018-08-26 Thread ALBERTO Bocchinfuso

Hi all,

It’s a while I’m trying to connect to a plasma store via scala.
I’m using the arrow Java API from scala, but if I use the classes in 
org.apache.arrow.plasma, I cannot reach the host.
I even tried to start the process before attempt to connect from the same 
program, similarly to:
https://github.com/apache/arrow/blob/master/java/plasma/src/test/java/org/apache/arrow/plasma/PlasmaClientTest.java

But I cannot reach the host either way. I post the code I wrote in the 
following:

import java.io.IOException;
import org.apache.arrow.plasma.PlasmaClient;

object TestPlasma {
def main(args: Array[String]) {
val builder = new ProcessBuilder("plasma_store", "-m", "268435456", 
"-s", "test");
try {
println(builder.start());
println("Start process " + " OK");
} catch  {
case ex: IOException =>
ex.printStackTrace();
}

val plasmaClient = new PlasmaClient("test","",1000);
}
}

>From the execution I get:

java.lang.UNIXProcess@1b39dcee
Start process  OK
[error] (run-main-7) java.lang.UnsatisfiedLinkError: 
org.apache.arrow.plasma.PlasmaClientJNI.connect(Ljava/lang/String;Ljava/lang/String;I)J
[error] java.lang.UnsatisfiedLinkError: 
org.apache.arrow.plasma.PlasmaClientJNI.connect(Ljava/lang/String;Ljava/lang/String;I)J
[error] at org.apache.arrow.plasma.PlasmaClientJNI.connect(Native 
Method)
[error] at 
org.apache.arrow.plasma.PlasmaClient.(PlasmaClient.java:43)
[error] at TestPlasma$.main(plasma_test.scala:15)
[error] at TestPlasma.main(plasma_test.scala)
[error] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method)
[error] at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
[error] at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[error] at java.lang.reflect.Method.invoke(Method.java:498)
[error] at sbt.Run.invokeMain(Run.scala:93)
[error] at sbt.Run.run0(Run.scala:87)
[error] at sbt.Run.execute$1(Run.scala:65)
[error] at sbt.Run.$anonfun$run$4(Run.scala:77)
[error] at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
[error] at 
sbt.util.InterfaceUtil$$anon$1.get(InterfaceUtil.scala:10)
[error] at sbt.TrapExit$App.run(TrapExit.scala:252)
[error] at java.lang.Thread.run(Thread.java:745)
[error] java.lang.RuntimeException: Nonzero exit code: 1
[error] at sbt.Run$.executeTrapExit(Run.scala:124)
[error] at sbt.Run.run(Run.scala:77)
[error] at sbt.Defaults$.$anonfun$bgRunTask$5(Defaults.scala:1185)
[error] at 
sbt.Defaults$.$anonfun$bgRunTask$5$adapted(Defaults.scala:1180)
[error] at 
sbt.internal.BackgroundThreadPool.$anonfun$run$1(DefaultBackgroundJobService.scala:366)
[error] at 
scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
[error] at scala.util.Try$.apply(Try.scala:209)
[error] at 
sbt.internal.BackgroundThreadPool$BackgroundRunnable.run(DefaultBackgroundJobService.scala:289)
[error] at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[error] at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[error] at java.lang.Thread.run(Thread.java:745)
[error] (Compile / run) Nonzero exit code: 1

If I try to reach the plasma store I start in the code above with a python 
script, I reach it without any problem.

Thank you in advance and sorry if I write both in dev and user, but I had many 
problems in subscribing to user, I haven’t understood if this mailing list is 
up and ready or not.

Alberto

Help with Java API and RecordBatch creation

2018-08-05 Thread ALBERTO Bocchinfuso


Good morning,

I have to use apache arrow with scala, so I’m using the Java API from scala, 
but I’m confused, I hope that someone is going to clarify something for me.

First of all, what is the difference between ArrowRecordBatch (in 
org.apache.arrow.vector.ipc.message) and RecordBatch (in 
org.apache.arrow.flatbuf)?
In this regard, if a coder wants to use arrow just for IPC, should she consider 
only the classes in the package org.apache.arrow.vector, or should she learn 
also how to use the other packages, particularly io.netty.buffer and 
org.apache.arrow.memory and org.apache.arrow.flatbuf?

I don’t understand how to perform in java everything that is done in python 
like in the documentation pages:
 http://arrow.apache.org/docs/python/data.html
 http://arrow.apache.org/docs/python/ipc.html

I’d like to understand how I can create what in python is called a RecordBatch, 
and serialize it in a stream, for example to write it on a file or whatever.
I think ArrowRecordBatch can be created by using the constructors, once you 
built a list of ArrowFieldNode (I haven’t understood what this class stands 
for, to be honest) and ArrowBuff (I haven’t understood how to create one, I 
think that I should instantiate an ArrowByteBufAllocator though alloc(), but 
then I wouldn’t know how to procede...), but I’m not sure.
I hope that my doubts are going to be cleared.

Thank you,
Alberto

Pickle data from python

2018-04-12 Thread ALBERTO Bocchinfuso

Hello,

I cannot pickle RecordBatches, Buffers etc.

I found Issue 1654 in the issue tracker, that has been solved with pull request 
1238. But this looks to apply only to the types listed (schemas, DataTypes, 
etc.).
When I try to Pickle Buffers etc. I get exactly the same error reported in the 
issue report.
Is the implementation of the possibility of pickling all the data types of 
pyarrow (with particular attention to RecordBatches etc.) on the agenda?

Thank you,
Alberto

[Documentation] Incomplete Documentation

2018-03-06 Thread ALBERTO Bocchinfuso

Hi everyone,

I am noting more and more that the API documentation is missing some functions 
or some fields. I can testify about the python APIs, which are the ones that I 
am using.
For example,
 Batch.num_rows
 Batch.num_columns
 Batch.schema
 pyarrow.concat_tables()

Are missing from the APIs, and I do thing that this can be a problem since I 
think that some developers (I can speak surely for myself.) are not reading 
arrow source code, but rely essentially on the documentation.
Even though there are traces of the existence of these functions in various 
other sections of the documentation, I think it is important to keep the 
references here up-to date.
I’d like to pinpoint this since I think that docs are really important to help 
the developers and boost the diffusion of software.

Thank you,
Alberto

R: [Python] Retrieving a RecordBatch from plasma inside a function

2018-02-21 Thread ALBERTO Bocchinfuso

Hi,

Have you had any news on this issue?
Do you plan to solve it for the next releases of Arrow, or is there any way to 
avoid the problem?

Thanks in advance,
Alberto
Da: Philipp Moritz<mailto:pcmor...@gmail.com>
Inviato: venerdì 9 febbraio 2018 00:30
A: dev@arrow.apache.org<mailto:dev@arrow.apache.org>
Oggetto: Re: [Python] Retrieving a RecordBatch from plasma inside a function

Thanks! I can indeed reproduce this problem. I'm a bit busy right now and
plan to look into it on the weekend.

Here is the preliminary backtrace for everybody interested:

CESS (code=1, address=0x38158)

frame #0: 0x00010e6457fc
lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28

lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py:

->  0x10e6457fc <+28>: movslq (%rdx,%rcx,4), %rdi

0x10e645800 <+32>: callq  0x10e698170   ; symbol stub for:
PyInt_FromLong

0x10e645805 <+37>: testq  %rax, %rax

0x10e645808 <+40>: je 0x10e64580c   ; <+44>

(lldb) bt

* thread #1: tid = 0xf1378e, 0x00010e6457fc
lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28,
queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1,
address=0x38158)

  * frame #0: 0x00010e6457fc
lib.so`__pyx_pw_7pyarrow_3lib_10Int32Value_1as_py(_object*, _object*) + 28

frame #1: 0x00010e5ccd35 lib.so`__Pyx_PyObject_CallNoArg(_object*)
+ 133

frame #2: 0x00010e613b25
lib.so`__pyx_pw_7pyarrow_3lib_10ArrayValue_3__repr__(_object*) + 933

frame #3: 0x00010c2f83bc libpython2.7.dylib`PyObject_Repr + 60

frame #4: 0x00010c35f651 libpython2.7.dylib`PyEval_EvalFrameEx +
22305

On Tue, Feb 6, 2018 at 1:24 AM, ALBERTO Bocchinfuso <
alberto_boc...@hotmail.it> wrote:

> Hi,
>
> I’m using python 3.5.2 and pyarrow 0.8.0
>
> As key, I put a string of 20 bytes, of course. I’m doing it differently
> from the canonical way since I’m no more using python 2.7, but python 3,
> and this seemed to me to be the right way to create a string of 20 bytes.
> The full code is:
>
> import pyarrow as pa
> import pyarrow.plasma as plasma
>
> def retrieve1():
>  client = plasma.connect('test', "", 0)
>
>  key = "keynumber1keynumber1"
>  pid = plasma.ObjectID(bytearray(key,'UTF-8'))
>
>  [buff] = client .get_buffers([pid])
>  batch = pa.RecordBatchStreamReader(buff).read_next_batch()
>
>  print(batch)
>  print(batch.schema)
>  print(batch[0])
>
>  return batch
>
> client = plasma.connect('test', "", 0)
>
> test1 = [1, 12, 23, 3, 21, 34]
> test1 = pa.array(test1, pa.int32())
>
> batch = pa.RecordBatch.from_arrays([test1], ['FIELD1'])
>
> key = "keynumber1keynumber1"
> pid = plasma.ObjectID(bytearray(key,'UTF-8'))
> sink = pa.MockOutputStream()
> stream_writer = pa.RecordBatchStreamWriter(sink, batch.schema)
> stream_writer.write_batch(batch)
> stream_writer.close()
>
> bff = client.create(pid, sink.size())
>
> stream = pa.FixedSizeBufferWriter(bff)
> writer = pa.RecordBatchStreamWriter(stream, batch.schema)
> writer.write_batch(batch)
> client.seal(pid)
>
> batch = retrieve1()
> print(batch)
> print(batch.schema)
> print(batch[0])
>
> I hope this helps,
> thank you
>
> Da: Philipp Moritz<mailto:pcmor...@gmail.com>
> Inviato: martedì 6 febbraio 2018 00:00
> A: dev@arrow.apache.org<mailto:dev@arrow.apache.org>
> Oggetto: Re: [Python] Retrieving a RecordBatch from plasma inside a
> function
>
> Hey Alberto,
>
> Thanks for your message! I'm trying to reproduce it.
>
> Can you attach the code you use to write the batch into the store?
>
> Also can you say which version of Python and Arrow you are using? On my
> installation, I get
>
> ```
>
> In [*5*]: plasma.ObjectID(bytearray("keynumber1keynumber1", "UTF-8"))
>
> 
> ---
>
> ValueError    Traceback (most recent call last)
>
>  in ()
>
> > 1 plasma.ObjectID(bytearray("keynumber1keynumber1", "UTF-8"))
>
>
> plasma.pyx in pyarrow.plasma.ObjectID.__cinit__()
>
>
> ValueError: Object ID must by 20 bytes, is keynumber1keynumber1
> ```
>
> (the canonical way to do this would be plasma.ObjectID(b
> "keynumber1keynumber1"))
>
> Best,
> Philipp.
>
> On Mon, Feb 5, 2018 at 10:09 AM, ALBERTO Bocchinfuso <
> alberto_boc...@hotmail.it> wrote:
>
> > Good morning,
> &

R: Merge multiple record batches

2018-02-14 Thread ALBERTO Bocchinfuso

Hi,
I don’t think I understood perfectly your point, but I try to give you the 
answer that looks the simplest to me.
In your code there isn’t any operation on table 1 and 2 separately, it just 
looks like you want to merge all those RecordBatches.
Now I think that:

  1.  you can use the to_batches() operation reported in the API for Table, but 
I never tried it myself. In this way you create 2 tables, create batches from 
these tables, put the batches togheter.
  2.  I would rather store ALL the BATCHES in the two streams in the SAME 
python LIST, and then create an unique table using from_batches() as you 
suggested. That’s because in your code you create two tables even though you 
don’t seem to care about them.

I didn’t try, but I think that you can go both ways and then tell us if the 
result is the same and if one of the two is faster then the other.

Alberto

Da: Rares Vernica
Inviato: mercoledì 14 febbraio 2018 05:13
A: dev@arrow.apache.org
Oggetto: Merge multiple record batches

Hi,

If I have multiple RecordBatchStreamReader inputs, what is the recommended
way to get all the RecordBatch from all the inputs together, maybe in a
Table? They all have the same schema. The source for the readers are
different files.

So, I do something like:

reader1 = pa.open_stream('foo')
table1 = reader1.read_all()

reader2 = pa.open_stream('bar')
table2 = reader2.read_all()

# table_all = ???
# OR maybe I don't need to create table1 and table2
# table_all = pa.Table.from_batches( ??? )

Thanks!
Rares

R: [Python] Retrieving a RecordBatch from plasma inside a function

2018-02-06 Thread ALBERTO Bocchinfuso

Hi,

I’m using python 3.5.2 and pyarrow 0.8.0

As key, I put a string of 20 bytes, of course. I’m doing it differently from 
the canonical way since I’m no more using python 2.7, but python 3, and this 
seemed to me to be the right way to create a string of 20 bytes.
The full code is:

import pyarrow as pa
import pyarrow.plasma as plasma

def retrieve1():
 client = plasma.connect('test', "", 0)

 key = "keynumber1keynumber1"
 pid = plasma.ObjectID(bytearray(key,'UTF-8'))

 [buff] = client .get_buffers([pid])
 batch = pa.RecordBatchStreamReader(buff).read_next_batch()

 print(batch)
 print(batch.schema)
 print(batch[0])

 return batch

client = plasma.connect('test', "", 0)

test1 = [1, 12, 23, 3, 21, 34]
test1 = pa.array(test1, pa.int32())

batch = pa.RecordBatch.from_arrays([test1], ['FIELD1'])

key = "keynumber1keynumber1"
pid = plasma.ObjectID(bytearray(key,'UTF-8'))
sink = pa.MockOutputStream()
stream_writer = pa.RecordBatchStreamWriter(sink, batch.schema)
stream_writer.write_batch(batch)
stream_writer.close()

bff = client.create(pid, sink.size())

stream = pa.FixedSizeBufferWriter(bff)
writer = pa.RecordBatchStreamWriter(stream, batch.schema)
writer.write_batch(batch)
client.seal(pid)

batch = retrieve1()
print(batch)
print(batch.schema)
print(batch[0])

I hope this helps,
thank you

Da: Philipp Moritz<mailto:pcmor...@gmail.com>
Inviato: martedì 6 febbraio 2018 00:00
A: dev@arrow.apache.org<mailto:dev@arrow.apache.org>
Oggetto: Re: [Python] Retrieving a RecordBatch from plasma inside a function

Hey Alberto,

Thanks for your message! I'm trying to reproduce it.

Can you attach the code you use to write the batch into the store?

Also can you say which version of Python and Arrow you are using? On my
installation, I get

```

In [*5*]: plasma.ObjectID(bytearray("keynumber1keynumber1", "UTF-8"))

---

ValueErrorTraceback (most recent call last)

 in ()

> 1 plasma.ObjectID(bytearray("keynumber1keynumber1", "UTF-8"))


plasma.pyx in pyarrow.plasma.ObjectID.__cinit__()


ValueError: Object ID must by 20 bytes, is keynumber1keynumber1
```

(the canonical way to do this would be plasma.ObjectID(b
"keynumber1keynumber1"))

Best,
Philipp.

On Mon, Feb 5, 2018 at 10:09 AM, ALBERTO Bocchinfuso <
alberto_boc...@hotmail.it> wrote:

> Good morning,
>
> I am experiencing problems with the RecordBatches stored in plasma in a
> particular situation.
>
> If I return a RecordBatch as result of a python function, I am able to
> read just the metadata, while I get an error when reading the columns.
>
> For example, the following code
> def retrieve1():
> client = plasma.connect('test', "", 0)
>
> key = "keynumber1keynumber1"
> pid = plasma.ObjectID(bytearray(key,'UTF-8'))
>
> [buff] = client .get_buffers([pid])
> batch = pa.RecordBatchStreamReader(buff).read_next_batch()
> return batch
>
> batch = retrieve1()
> print(batch)
> print(batch.schema)
> print(batch[0])
>
> Represents a simple python code in which a function is in charge of
> retrieving the RecordBatch from the plasma store, and then returns it to
> the caller. Running the previous example I get:
> 
> FIELD1: int32
> metadata
> 
> {}
> 
> [
>   1,
>   12,
>   23,
>   3,
>   21,
>   34
> ]
> 
> FIELD1: int32
> metadata
> 
> {}
> Errore di segmentazione (core dump creato)
>
>
> If I retrieve and use the data in the same part of the code (as I do in
> the function retrieve1(), but it also works when I put everything in the
> main program.) everything runs without problems.
>
> Also the problem seems to be related to the particular case in which I
> retrieve the RecordBatch from the plasma store, since the following
> (simpler) code:
> def create():
> test1 = [1, 12, 23, 3, 21, 34]
> test1 = pa.array(test1, pa.int32())
>
> batch = pa.RecordBatch.from_arrays([test1], ['FIELD1'])
> print(batch)
> print(batch.schema)
> print(batch[0])
> return batch
>
> batch1 = create()
> print(batch1)
> print(batch1.schema)
> print(batch1[0])
>
> Prints:
>
> 
> FIELD1: int32
> 
> [
>   1,
>   12,
>   23,
>   3,
>   21,
>   34
> ]
> 
> FIELD1: int32
> 
> [
>   1,
>   12,
>   23,
>   3,
>   21,
>   34
> ]
>
> Which is what I expect.
>
> Is this issue known or am I doing something wrong when retrieving the
> RecordBatch from plasma?
>
> Also I would like to pinpoint the fact that this problem was as easy to
> find as hard to re-create. For this reason, there can be other situations
> in which the same problem arises that I did not experienced, since I mostly
> deal with plasma and I’ve been using only python so long: the description I
> gave is not intended to be complete.
>
> Thank you,
> Alberto
>

[Python] Retrieving a RecordBatch from plasma inside a function

2018-02-05 Thread ALBERTO Bocchinfuso

Good morning,

I am experiencing problems with the RecordBatches stored in plasma in a 
particular situation.

If I return a RecordBatch as result of a python function, I am able to read 
just the metadata, while I get an error when reading the columns.

For example, the following code
def retrieve1():
client = plasma.connect('test', "", 0)

key = "keynumber1keynumber1"
pid = plasma.ObjectID(bytearray(key,'UTF-8'))

[buff] = client .get_buffers([pid])
batch = pa.RecordBatchStreamReader(buff).read_next_batch()
return batch

batch = retrieve1()
print(batch)
print(batch.schema)
print(batch[0])

Represents a simple python code in which a function is in charge of retrieving 
the RecordBatch from the plasma store, and then returns it to the caller. 
Running the previous example I get:

FIELD1: int32
metadata

{}

[
  1,
  12,
  23,
  3,
  21,
  34
]

FIELD1: int32
metadata

{}
Errore di segmentazione (core dump creato)


If I retrieve and use the data in the same part of the code (as I do in the 
function retrieve1(), but it also works when I put everything in the main 
program.) everything runs without problems.

Also the problem seems to be related to the particular case in which I retrieve 
the RecordBatch from the plasma store, since the following (simpler) code:
def create():
test1 = [1, 12, 23, 3, 21, 34]
test1 = pa.array(test1, pa.int32())

batch = pa.RecordBatch.from_arrays([test1], ['FIELD1'])
print(batch)
print(batch.schema)
print(batch[0])
return batch

batch1 = create()
print(batch1)
print(batch1.schema)
print(batch1[0])

Prints:


FIELD1: int32

[
  1,
  12,
  23,
  3,
  21,
  34
]

FIELD1: int32

[
  1,
  12,
  23,
  3,
  21,
  34
]

Which is what I expect.

Is this issue known or am I doing something wrong when retrieving the 
RecordBatch from plasma?

Also I would like to pinpoint the fact that this problem was as easy to find as 
hard to re-create. For this reason, there can be other situations in which the 
same problem arises that I did not experienced, since I mostly deal with plasma 
and I’ve been using only python so long: the description I gave is not intended 
to be complete.

Thank you,
Alberto

I: [Scala/Java] Writing a RecordBatch on a file using scala

[Scala] Plasma client for java not connecting

Help with Java API and RecordBatch creation

Pickle data from python

[Documentation] Incomplete Documentation

R: [Python] Retrieving a RecordBatch from plasma inside a function

R: Merge multiple record batches

R: [Python] Retrieving a RecordBatch from plasma inside a function

[Python] Retrieving a RecordBatch from plasma inside a function

9 matches

Site Navigation

Mail list logo

Footer information