Re: Timestamps with different precision / Timedeltas

2016-09-29 Thread Jacques Nadeau
+1

On Thu, Sep 29, 2016 at 3:19 PM, Wes McKinney  wrote:

> hello,
>
> For the current iteration of Arrow, can we agree to support int64 UNIX
> timestamps with a particular resolution (second through nanosecond),
> as these are reasonably common representations? We can look to expand
> later if it is needed.
>
> Thanks
> Wes
>
> On Mon, Aug 15, 2016 at 4:12 AM, Wes McKinney  wrote:
> > Bumping this discussion. As part of finalizing a v1 Arrow spec (for
> > purposes of moving data between systems, at minimum) we should propose
> > timestamp metadata and physical memory representation that maximizes
> > interoperability with other systems. It seems like a fixed decimal
> > would meet this requirement as UNIX-like timestamps at some resolution
> > could pass unmodified with appropriate metadata.
> >
> > We will also need decimal types in Arrow (at least to accommodate
> > common database representations and file formats like Parquet), so
> > this seems like a reasonable potential hierarchy of types:
> >
> > Timestamp [logical type]
> > extends FixedDecimal [logical type]
> > extends FixedWidth [physical type]
> >
> > I did a bit of internet searching but did not find a canonical
> > reference or implementation of fixed decimals; that would be helpful.
> >
> > As an aside: for floating decimal numbers for numerical data we could
> > utilize an implementation like http://www.bytereef.org/mpdecimal/
> > which implements the spec described at
> > http://speleotrove.com/decimal/decarith.html
> >
> > Thanks
> > Wes
> >
> > On Thu, Jul 14, 2016 at 8:18 AM, Alex Samuel 
> wrote:
> >> Hi all,
> >>
> >> May I suggest that instead of fixed-point decimals, you consider a more
> >> general fixed-denominator rational representation, for times and other
> >> purposes? Powers of ten are convenient for humans, but powers of two
> more
> >> efficient. For some applications, the efficiency of bit operations over
> >> divmod is more useful than an exact representation of integral
> nanoseconds.
> >>
> >> std::chrono takes this approach. I'll also humbly point you at my own
> >> date/time library, https://github.com/alexhsamuel/cron (incomplete but
> >> basically working), which may provide ideas or useful code. It was
> intended
> >> for precisely this sort of application.
> >>
> >> Regards,
> >> Alex
> >>
> >>
> >> On Thu, Jul 14, 2016 at 10:27 AM Uwe Korn  wrote:
> >>
> >>> I agree with that having a Decimal type for timestamps is a nice
> >>> definition. Haying your time encoded as seconds or nanoseconds should
> be
> >>> the same as having a scale of the respective amount. But I would rather
> >>> avoid having a separate decimal physical type. Therefore I'd prefer the
> >>> parquet approach where decimal is only a logical type and backed by
> >>> either a bytearray, int32 or int64.
> >>>
> >>> Thus a more general timestamp could look like:
> >>>
> >>> * Decimals are logical types, physical types are the same as defined in
> >>> Parquet [1]
> >>> * Base unit for timestamps is seconds, you can get milliseconds and
> >>> nanoseconds by using a different scale. .(Note that seconds and so on
> >>> are all powers of ten, thus matching the specification of decimal scale
> >>> really good).
> >>> * Timestamp is just another logical type that is referring to Decimal
> >>> (and optionally may have a timezone) and signalling that we have a Time
> >>> and not just a "simple" decimal.
> >>> * For a first iteration, I would assume no timezone or UTC but not
> >>> include a metadata field. Once we're sure the implementation works, we
> >>> can add metadata about it.
> >>>
> >>> Timedeltas could be addressed in a similar way, just without the need
> >>> for a timezone.
> >>>
> >>> For my usages, I don't have the use-case for a larger than int64
> >>> timestamp and would like to have it exactly as such in my computation,
> >>> thus my preference for the Parquet way.
> >>>
> >>> Uwe
> >>>
> >>> [1]
> >>>
> >>> https://github.com/apache/parquet-format/blob/master/
> LogicalTypes.md#decimal
> >>>
> >>> On 13.07.16 03:06, Julian Hyde wrote:
> >>> > I'm talking about a fixed decimal type, not floating decimal. (Oracle
> >>> > numbers are floating decimal. They have a few nice properties, but
> >>> > they are variable width and can get quite large. I've seen one or two
> >>> > systems that started with binary flo
> >>
> >>
> >>> * Base unit for timestamps is seconds, you can get milliseconds and
> >>
> >> nanoseconds by using a different scale. .(Note that seconds and so on
> >>
> >> are all powers of ten, thus matching the specification of decimal scale
> >>
> >> really good).
> >>
> >> * Timestamp is just another logical type that is referring to Decimal
> >>
> >> (and optionally may have a timezone) and signalling that we have a Tim
> >>
> >> ating point numbers, which are
> >>> > much worse for business computing, and then change to Java
> BigDecimal,
> >>> > 

[jira] [Commented] (ARROW-96) C++: API documentation using Doxygen

2016-09-29 Thread Wes McKinney (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-96?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15534318#comment-15534318
 ] 

Wes McKinney commented on ARROW-96:
---

{{///}} sounds fine to me. Once we get a docs build up we can start being more 
diligent about writing API documentation

> C++: API documentation using Doxygen 
> -
>
> Key: ARROW-96
> URL: https://issues.apache.org/jira/browse/ARROW-96
> Project: Apache Arrow
>  Issue Type: Task
>  Components: C++
>Reporter: Uwe L. Korn
>Assignee: Uwe L. Korn
>
> For the developers using Arrow via C++, we should provide an automatically 
> generated API documentation via doxygen.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


Re: Timestamps with different precision / Timedeltas

2016-09-29 Thread Wes McKinney
hello,

For the current iteration of Arrow, can we agree to support int64 UNIX
timestamps with a particular resolution (second through nanosecond),
as these are reasonably common representations? We can look to expand
later if it is needed.

Thanks
Wes

On Mon, Aug 15, 2016 at 4:12 AM, Wes McKinney  wrote:
> Bumping this discussion. As part of finalizing a v1 Arrow spec (for
> purposes of moving data between systems, at minimum) we should propose
> timestamp metadata and physical memory representation that maximizes
> interoperability with other systems. It seems like a fixed decimal
> would meet this requirement as UNIX-like timestamps at some resolution
> could pass unmodified with appropriate metadata.
>
> We will also need decimal types in Arrow (at least to accommodate
> common database representations and file formats like Parquet), so
> this seems like a reasonable potential hierarchy of types:
>
> Timestamp [logical type]
> extends FixedDecimal [logical type]
> extends FixedWidth [physical type]
>
> I did a bit of internet searching but did not find a canonical
> reference or implementation of fixed decimals; that would be helpful.
>
> As an aside: for floating decimal numbers for numerical data we could
> utilize an implementation like http://www.bytereef.org/mpdecimal/
> which implements the spec described at
> http://speleotrove.com/decimal/decarith.html
>
> Thanks
> Wes
>
> On Thu, Jul 14, 2016 at 8:18 AM, Alex Samuel  wrote:
>> Hi all,
>>
>> May I suggest that instead of fixed-point decimals, you consider a more
>> general fixed-denominator rational representation, for times and other
>> purposes? Powers of ten are convenient for humans, but powers of two more
>> efficient. For some applications, the efficiency of bit operations over
>> divmod is more useful than an exact representation of integral nanoseconds.
>>
>> std::chrono takes this approach. I'll also humbly point you at my own
>> date/time library, https://github.com/alexhsamuel/cron (incomplete but
>> basically working), which may provide ideas or useful code. It was intended
>> for precisely this sort of application.
>>
>> Regards,
>> Alex
>>
>>
>> On Thu, Jul 14, 2016 at 10:27 AM Uwe Korn  wrote:
>>
>>> I agree with that having a Decimal type for timestamps is a nice
>>> definition. Haying your time encoded as seconds or nanoseconds should be
>>> the same as having a scale of the respective amount. But I would rather
>>> avoid having a separate decimal physical type. Therefore I'd prefer the
>>> parquet approach where decimal is only a logical type and backed by
>>> either a bytearray, int32 or int64.
>>>
>>> Thus a more general timestamp could look like:
>>>
>>> * Decimals are logical types, physical types are the same as defined in
>>> Parquet [1]
>>> * Base unit for timestamps is seconds, you can get milliseconds and
>>> nanoseconds by using a different scale. .(Note that seconds and so on
>>> are all powers of ten, thus matching the specification of decimal scale
>>> really good).
>>> * Timestamp is just another logical type that is referring to Decimal
>>> (and optionally may have a timezone) and signalling that we have a Time
>>> and not just a "simple" decimal.
>>> * For a first iteration, I would assume no timezone or UTC but not
>>> include a metadata field. Once we're sure the implementation works, we
>>> can add metadata about it.
>>>
>>> Timedeltas could be addressed in a similar way, just without the need
>>> for a timezone.
>>>
>>> For my usages, I don't have the use-case for a larger than int64
>>> timestamp and would like to have it exactly as such in my computation,
>>> thus my preference for the Parquet way.
>>>
>>> Uwe
>>>
>>> [1]
>>>
>>> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#decimal
>>>
>>> On 13.07.16 03:06, Julian Hyde wrote:
>>> > I'm talking about a fixed decimal type, not floating decimal. (Oracle
>>> > numbers are floating decimal. They have a few nice properties, but
>>> > they are variable width and can get quite large. I've seen one or two
>>> > systems that started with binary flo
>>
>>
>>> * Base unit for timestamps is seconds, you can get milliseconds and
>>
>> nanoseconds by using a different scale. .(Note that seconds and so on
>>
>> are all powers of ten, thus matching the specification of decimal scale
>>
>> really good).
>>
>> * Timestamp is just another logical type that is referring to Decimal
>>
>> (and optionally may have a timezone) and signalling that we have a Tim
>>
>> ating point numbers, which are
>>> > much worse for business computing, and then change to Java BigDecimal,
>>> > which gives the right answer but are horribly inefficient.)
>>> >
>>> > A fixed decimal type has virtually zero computational overhead. It
>>> > just has a piece of metadata saying something like "every value in
>>> > this field is multiplied by 1 million" and leaves it to the client
>>> > program to do that multiplying.

[jira] [Created] (ARROW-311) [C++] Create a CLI tool that reads an Arrow file and then writes it back out

2016-09-29 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-311:
--

 Summary: [C++] Create a CLI tool that reads an Arrow file and then 
writes it back out
 Key: ARROW-311
 URL: https://issues.apache.org/jira/browse/ARROW-311
 Project: Apache Arrow
  Issue Type: New Feature
  Components: C++
Reporter: Wes McKinney
Assignee: Wes McKinney


This will assist with integration testing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (ARROW-310) [Java] Create CLI tool to read an Arrow file and write it out (for integration testing)

2016-09-29 Thread Wes McKinney (JIRA)
Wes McKinney created ARROW-310:
--

 Summary: [Java] Create CLI tool to read an Arrow file and write it 
out (for integration testing)
 Key: ARROW-310
 URL: https://issues.apache.org/jira/browse/ARROW-310
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java - Vectors
Reporter: Wes McKinney


There will need to be an analogous tool in C++



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (ARROW-308) UnionListWriter.setPosition() should not call startList()

2016-09-29 Thread Deneche A. Hakim (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15533196#comment-15533196
 ] 

Deneche A. Hakim edited comment on ARROW-308 at 9/29/16 9:36 PM:
-

--The fix I merged is incomplete, many writers are working under the assumption 
that UnionListWriter.setPosition() also calls startList(). Working on proper 
fix, but if it takes too long I will probably just revert the merged PR until I 
get it done--


was (Author: adeneche):
The fix I merged is incomplete, many writers are working under the assumption 
that UnionListWriter.setPosition() also calls startList(). Working on proper 
fix, but if it takes too long I will probably just revert the merged PR until I 
get it done

> UnionListWriter.setPosition() should not call startList()
> -
>
> Key: ARROW-308
> URL: https://issues.apache.org/jira/browse/ARROW-308
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java - Vectors
>Reporter: Deneche A. Hakim
>Assignee: Deneche A. Hakim
>
> UnionListWriter.setPosition() is implemented as follows:
> {code}
>   @Override
>   public void setPosition(int index) {
> super.setPosition(index);
> startList();
>   }
> {code}
> It works fine, but if you run the following code:
> {code}
> MapVector parent = new MapVector("parent", allocator, null);
> ComplexWriter writer = new ComplexWriterImpl("root", parent);
> MapWriter rootWriter = writer.rootAsMap();
> rootWriter.start();
> rootWriter.bigInt("int").writeBigInt(0);
> rootWriter.list("list").startList();
> rootWriter.list("list").bigInt().writeBigInt(0);
> rootWriter.list("list").endList();
> rootWriter.end();
> rootWriter.setPosition(1);
> rootWriter.start();
> rootWriter.bigInt("int").writeBigInt(1);
> rootWriter.end();
> rootWriter.setPosition(2);
> rootWriter.bigInt("int").writeBigInt(2);
> rootWriter.start();
> rootWriter.list("list").startList();
> rootWriter.list("list").bigInt().writeBigInt(2);
> rootWriter.list("list").endList();
> rootWriter.end();
> writer.setValueCount(3);
> for (int i = 0; i < 3; i++) {
>   parent.getReader().setPosition(i);
>   System.out.printf("%d: %s%n", i, parent.getReader().readObject());
> }
> {code}
> You get:
> {noformat}
> 0: {"root":{"int":0,"list":[0]}}
> 1: {"root":{"int":1,"list":[]}}
> 2: {"root":{"int":2,"list":[2]}}
> {noformat}
> Even though we didn't write anything in the 2nd row "list", it shows up as 
> empty instead of null. I tracked the problem to UnionListWriter.setPosition() 
> calling startList() which marks the row as not null even if we don't write 
> anything to it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (ARROW-309) Types.getMinorTypeForArrowType() does not work for Union type

2016-09-29 Thread Julien Le Dem (JIRA)
Julien Le Dem created ARROW-309:
---

 Summary: Types.getMinorTypeForArrowType() does not work for Union 
type
 Key: ARROW-309
 URL: https://issues.apache.org/jira/browse/ARROW-309
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java - Vectors
Reporter: Julien Le Dem
Assignee: Julien Le Dem






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (ARROW-308) UnionListWriter.setPosition() should not call startList()

2016-09-29 Thread Deneche A. Hakim (JIRA)

[ 
https://issues.apache.org/jira/browse/ARROW-308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15533196#comment-15533196
 ] 

Deneche A. Hakim commented on ARROW-308:


The fix I merged is incomplete, many writers are working under the assumption 
that UnionListWriter.setPosition() also calls startList(). Working on proper 
fix, but if it takes too long I will probably just revert the merged PR until I 
get it done

> UnionListWriter.setPosition() should not call startList()
> -
>
> Key: ARROW-308
> URL: https://issues.apache.org/jira/browse/ARROW-308
> Project: Apache Arrow
>  Issue Type: Bug
>  Components: Java - Vectors
>Reporter: Deneche A. Hakim
>Assignee: Deneche A. Hakim
>
> UnionListWriter.setPosition() is implemented as follows:
> {code}
>   @Override
>   public void setPosition(int index) {
> super.setPosition(index);
> startList();
>   }
> {code}
> It works fine, but if you run the following code:
> {code}
> MapVector parent = new MapVector("parent", allocator, null);
> ComplexWriter writer = new ComplexWriterImpl("root", parent);
> MapWriter rootWriter = writer.rootAsMap();
> rootWriter.start();
> rootWriter.bigInt("int").writeBigInt(0);
> rootWriter.list("list").startList();
> rootWriter.list("list").bigInt().writeBigInt(0);
> rootWriter.list("list").endList();
> rootWriter.end();
> rootWriter.setPosition(1);
> rootWriter.start();
> rootWriter.bigInt("int").writeBigInt(1);
> rootWriter.end();
> rootWriter.setPosition(2);
> rootWriter.bigInt("int").writeBigInt(2);
> rootWriter.start();
> rootWriter.list("list").startList();
> rootWriter.list("list").bigInt().writeBigInt(2);
> rootWriter.list("list").endList();
> rootWriter.end();
> writer.setValueCount(3);
> for (int i = 0; i < 3; i++) {
>   parent.getReader().setPosition(i);
>   System.out.printf("%d: %s%n", i, parent.getReader().readObject());
> }
> {code}
> You get:
> {noformat}
> 0: {"root":{"int":0,"list":[0]}}
> 1: {"root":{"int":1,"list":[]}}
> 2: {"root":{"int":2,"list":[2]}}
> {noformat}
> Even though we didn't write anything in the 2nd row "list", it shows up as 
> empty instead of null. I tracked the problem to UnionListWriter.setPosition() 
> calling startList() which marks the row as not null even if we don't write 
> anything to it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)