Hi Wes,

Yes, sorry for the mess. Here is the message in plain text:

The libgdf project defines a column structure that in a simplified form
could be represented as

typedef struct {
    void *data;                          // column data
    unsigned char *valid;          // validity mask, one bit per column item
    size_t size;                         // nof items
    enum {INT8, INT16, ...} dtype; // type of column item
    size_t null_count;               // nof non-valid items
} my_column_t;

The aim is to implement IPC protocol for sharing my_column_t data between
host and GPU devices.

What would be the most sensible way to do that using tools available in
Arrow library?

We are currently considering the following approaches:

1. Re-using Arrow Array: my_column_t and Arrow Array have one-to-one
correspondence regarding data content.

2. Defining new Arrow format MyColumn (using Arrow Tensor as an example):

table MyColumn {
  /// The type of data contained in a value cell.
  type: Type;
  /// The number of non-valid items
  null_count: long;
  /// The location and size of the column's data
  data: Buffer;
  /// The location and size of the column's mask
  valid: Buffer;
}

We are uncertain which approach would be easiest to implement and maintain,
be efficient (0-copy), or would make sense at all.

Defining Arrow MyColumn seems appealing because of about 7 times less code
in Arrow Tensor than in Arrow Array. However, Arrow Array includes validity
mask already.

What do you think?

Best regards,
Pearu


On Wed, Aug 22, 2018 at 11:53 PM, Wes McKinney <wesmck...@gmail.com> wrote:

> Hi Pearu,
>
> Seems the formatting of your email got messed up a little bit. Can you
> resend with some more line breaks?
>
> Thanks
>
>
> On Wed, Aug 22, 2018, 4:46 PM Pearu Peterson <pearu.peter...@quansight.com
> >
> wrote:
>
> > *Hi,The libgdf project defines a column structure that in a simplified
> form
> > could be represented astypedef struct {    void *data;
> //
> > column data    unsigned char *valid; // validity mask // one bit per
> column
> > item    size_t size;                 // nof items    enum {INT8, INT16,
> > ...} dtype; // type of column item    size_t null_count;           // nof
> > non-valid items} my_column_t;The aim is to implement IPC protocol for
> > sharing my_column_t data between host and GPU devices. What would be the
> > most sensible way to do that using tools available in Arrow library?We
> are
> > currently considering the following approaches:1. Re-using Arrow Array
> > (C++): my_column_t and Arrow Array have one-to-one correspondence
> regarding
> > data content.2. Defining new Arrow format MyColumn (using Arrow Tensor as
> > an example):table MyColumn {  /// The type of data contained in a value
> > cell.  type: Type;  /// The number of non-valid items  null_count: long;
> >  /// The location and size of the column's data  data: Buffer;  /// The
> > location and size of the column's mask  valid: Buffer;}We are uncertain
> > which approach would be easiest to implement and maintain, be efficient
> > (0-copy), or would make sense at all.Defining Arrow MyColumn seems
> > appealing because of about 7 times less code in Arrow Tensor than in
> Arrow
> > Array. However, Arrow Array includes validity mask already.What do you
> > think?Best regards,Pearu*
> >
>

Reply via email to