[jira] [Created] (ARROW-9010) [Java] Framework and interface changes for RecordBatch IPC buffer compression

2020-06-02 Thread Liya Fan (Jira)
Liya Fan created ARROW-9010:
---

 Summary: [Java] Framework and interface changes for RecordBatch 
IPC buffer compression
 Key: ARROW-9010
 URL: https://issues.apache.org/jira/browse/ARROW-9010
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


This is the first sub-work item of ARROW-8672 (
[Java] Implement RecordBatch IPC buffer compression from ARROW-300). However, 
it does not involve any concrete compression algorithms. The purpose of this PR 
is to establish basic interfaces for data compression, and make changes to the 
IPC framework so that different compression algorithms can be plug-in smoothly. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8973) [Java] Support batch value appending for large varchar/varbinary vectors

2020-05-27 Thread Liya Fan (Jira)
Liya Fan created ARROW-8973:
---

 Summary: [Java] Support batch value appending for large 
varchar/varbinary vectors
 Key: ARROW-8973
 URL: https://issues.apache.org/jira/browse/ARROW-8973
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan


Support appending values in batch for LargeVarCharVector/LargeVarBinaryVector.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8972) [Java] Support range value comparison for large varchar/varbinary vectors

2020-05-27 Thread Liya Fan (Jira)
Liya Fan created ARROW-8972:
---

 Summary: [Java] Support range value comparison for large 
varchar/varbinary vectors
 Key: ARROW-8972
 URL: https://issues.apache.org/jira/browse/ARROW-8972
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


Support comparing a range of values for LargeVarCharVector and 
LargeVarBinaryVector.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8940) [Java] Fix the performance degradation of integration tests

2020-05-26 Thread Liya Fan (Jira)
Liya Fan created ARROW-8940:
---

 Summary: [Java] Fix the performance degradation of integration 
tests
 Key: ARROW-8940
 URL: https://issues.apache.org/jira/browse/ARROW-8940
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


In the past, we run integration tests from main methods, and recently, we have 
changed this to run them by the failsafe plugin. 

This is a good change, but it also leads to significant performance 
degradation. In the past, it took about 10s to run 
{{ITTestLargeVector#testLargeDecimalVector}}, now it takes more than half an 
hour. 

Our investigation shows that the problem was caused by calling 
{{HistoricalLog#recordEvent}} repeatedly. This method is called only when 
{{BaseAllocator#DEBUG}} is enabled. In a unit/integration test, the flag is 
enabled by default. 

We solve the problem with the following steps:
1. We set system property to disable the {{BaseAllocator#DEBUG}} flag.
2. We change the logic so that the system property takes precedence over the 
{{AssertionUtil#isAssertionsEnabled}} method. 

This makes the integration tests as fast as before. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8771) [C++] Add boost/process library to build support

2020-05-11 Thread Liya Fan (Jira)
Liya Fan created ARROW-8771:
---

 Summary: [C++] Add boost/process library to build support
 Key: ARROW-8771
 URL: https://issues.apache.org/jira/browse/ARROW-8771
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Liya Fan


Some of our test source code requires the process.hpp file (and its dependent 
libraries). Our current build support does not include these files, causing 
build failures like:

fatal error: boost/process.hpp: No such file or directory



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8761) [C++] Improve the performance of minmax kernel

2020-05-11 Thread Liya Fan (Jira)
Liya Fan created ARROW-8761:
---

 Summary: [C++] Improve the performance of minmax kernel
 Key: ARROW-8761
 URL: https://issues.apache.org/jira/browse/ARROW-8761
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Liya Fan
Assignee: Liya Fan


We improve the performance of the max-min kernel with the simple idea: if the 
current value is smaller than the current min value; then there is no need to 
compare it against the current max value, because it must be smaller than the 
current max value. 

This simple trick reduces the expected number of comparisons from 2n to 1.5n, 
which can be notable for large arrays. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8481) [Java] Provide an allocation manager based on Unsafe API

2020-04-16 Thread Liya Fan (Jira)
Liya Fan created ARROW-8481:
---

 Summary: [Java] Provide an allocation manager based on Unsafe API
 Key: ARROW-8481
 URL: https://issues.apache.org/jira/browse/ARROW-8481
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


This is in response to the discussion in 
https://github.com/apache/arrow/pull/6323#issuecomment-614195070

In this issue, we provide an allocation manager that is capable of allocation 
large (> 2GB) buffers. In addition, it does not depend on the netty library, 
which is aligning with the general trend of removing netty dependencies. In the 
future, we are going to make it the default allocation manager. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8468) [Document] Fix the incorrect null bits description

2020-04-15 Thread Liya Fan (Jira)
Liya Fan created ARROW-8468:
---

 Summary: [Document] Fix the incorrect null bits description
 Key: ARROW-8468
 URL: https://issues.apache.org/jira/browse/ARROW-8468
 Project: Apache Arrow
  Issue Type: Bug
  Components: Documentation
Reporter: Liya Fan
Assignee: Liya Fan


The desription about the null bits in arrays.rst is incorrect.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8402) [Java] Support ValidateFull methods in Java

2020-04-11 Thread Liya Fan (Jira)
Liya Fan created ARROW-8402:
---

 Summary: [Java] Support ValidateFull methods in Java
 Key: ARROW-8402
 URL: https://issues.apache.org/jira/browse/ARROW-8402
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


We need to support ValidateFull methods in Java, just like we do in C++. 
This is required by ARROW-5926.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8392) [Java] Fix overflow related corner cases for vector value comparison

2020-04-09 Thread Liya Fan (Jira)
Liya Fan created ARROW-8392:
---

 Summary: [Java] Fix overflow related corner cases for vector value 
comparison
 Key: ARROW-8392
 URL: https://issues.apache.org/jira/browse/ARROW-8392
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Liya Fan


1. Fix corner cases related to overflow.
2. Provide test cases for the corner cases. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8230) [Java] Move Netty memory manager into a separate module

2020-03-26 Thread Liya Fan (Jira)
Liya Fan created ARROW-8230:
---

 Summary: [Java] Move Netty memory manager into a separate module
 Key: ARROW-8230
 URL: https://issues.apache.org/jira/browse/ARROW-8230
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan


Move Netty memory manager into a separate module such that the basic allocator 
does not depend on it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8229) [Java] Move ArrowBuf into the Arrow package

2020-03-26 Thread Liya Fan (Jira)
Liya Fan created ARROW-8229:
---

 Summary: [Java] Move ArrowBuf into the Arrow package
 Key: ARROW-8229
 URL: https://issues.apache.org/jira/browse/ARROW-8229
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


After ARROW-7505 and ARROW-7935 are done, we are ready to move ArrowBuf into 
Arrow's package, and make it independent of Netty library.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8169) [Java] Improve the performance of JDBC adapter by allocating memory proactively

2020-03-19 Thread Liya Fan (Jira)
Liya Fan created ARROW-8169:
---

 Summary: [Java] Improve the performance of JDBC adapter by 
allocating memory proactively
 Key: ARROW-8169
 URL: https://issues.apache.org/jira/browse/ARROW-8169
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


The current implementation use {{setSafe}} methods to dynamically allocate 
memory if necessary. For fixed width vectors (which are frequently used in 
JDBC), however, we can allocate memory proactively, since the vector size is 
known as a configuration parameter. So for fixed width vectors, we can use 
{{set}} methods instead.

This change leads to two benefits:
1. When processing each value, we no longer have to check vector capacity and 
reallocate memroy if needed. This leads to better performance.
2. If we allow the memory to expand automatically (each time by 2x), the amount 
of memory usually ends up being more than necessary. By allocating memory by 
the configuration parameter, we allocate no more, or no less. 

Benchmark results show notable performance improvements:

Before:

Benchmark   Mode  CntScore   Error  Units
JdbcAdapterBenchmarks.consumeBenchmark  avgt5  521.700 ± 4.837  us/op

After:

Benchmark   Mode  CntScore   Error  Units
JdbcAdapterBenchmarks.consumeBenchmark  avgt5  430.523 ± 9.932  us/op



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8121) [Java] Enhance code style checking for Java code (add space after commas, semi-colons and type casts)

2020-03-14 Thread Liya Fan (Jira)
Liya Fan created ARROW-8121:
---

 Summary: [Java] Enhance code style checking for Java code (add 
space after commas, semi-colons and type casts)
 Key: ARROW-8121
 URL: https://issues.apache.org/jira/browse/ARROW-8121
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


This is in response to a discussion in 
https://github.com/apache/arrow/pull/6039#discussion_r375161992

We found the current style checking for Java code is not sufficient. So we want 
to enhace it in a series of "small" steps, in order to avoid having to change 
too many files at once.

In this issue, we add spaces after commas, semi-colons and type casts.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8108) [Java] Extract a common interface for dictionary encoders

2020-03-12 Thread Liya Fan (Jira)
Liya Fan created ARROW-8108:
---

 Summary: [Java] Extract a common interface for dictionary encoders
 Key: ARROW-8108
 URL: https://issues.apache.org/jira/browse/ARROW-8108
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


In this issue, we extract a common interfaces from existing dictionary 
encoders. This can be useful for scenarios when the client does not care about 
the encoder implementations. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-8009) [Java] Fix the hash code mehods for BitVector

2020-03-05 Thread Liya Fan (Jira)
Liya Fan created ARROW-8009:
---

 Summary: [Java] Fix the hash code mehods for BitVector
 Key: ARROW-8009
 URL: https://issues.apache.org/jira/browse/ARROW-8009
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


The current hash code methods of BitVector are based on implementations in 
BaseFixedWidthVector, which rely on the type width of the vector. 
For BitVector, the type width is 0, so the underlying data is not actually used 
when computing the hash code. That means, the hash code will always be 0, no 
matter if the underlying data is null or not, and no matter if the underlying 
bit is 0 or 1. 

We fix this by overriding the methods in BitVector. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7955) [Java] Support large buffer for file/stream IPC

2020-02-27 Thread Liya Fan (Jira)
Liya Fan created ARROW-7955:
---

 Summary: [Java] Support large buffer for file/stream IPC
 Key: ARROW-7955
 URL: https://issues.apache.org/jira/browse/ARROW-7955
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


After supporting 64-bit ArrowBuf, we need to make file/stream IPC work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7935) [Java] Remove Netty dependency for BufferAllocator and ReferenceManager

2020-02-25 Thread Liya Fan (Jira)
Liya Fan created ARROW-7935:
---

 Summary: [Java] Remove Netty dependency for BufferAllocator and 
ReferenceManager
 Key: ARROW-7935
 URL: https://issues.apache.org/jira/browse/ARROW-7935
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


With previous work (ARROW-7329 and ARROW-7505), Netty based allocation is only 
one of the possible implementations. So we need to revise BufferAllocator and 
ReferenceManager, to make them general, and independent Netty libraries.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7746) [Java] Support large buffer for IPC

2020-02-02 Thread Liya Fan (Jira)
Liya Fan created ARROW-7746:
---

 Summary: [Java] Support large buffer for IPC
 Key: ARROW-7746
 URL: https://issues.apache.org/jira/browse/ARROW-7746
 Project: Apache Arrow
  Issue Type: Task
  Components: Java
Reporter: Liya Fan


The motivation is described in 
https://github.com/apache/arrow/pull/6323#issuecomment-580137629.

When the size of the ArrowBuf exceeds 2GB, our flighing library does not work 
due to integer overflow. 

This is because internally, we have used some data structures which are based 
on 32-bit integers. To resolve the problem, we must revise/replace the data 
structures to make them support 64-bit integers. 

As a concrete example, we can see that when the server sends data through IPC, 
an org.apache.arrow.flight.ArrowMessage object is created, and is wrapped as an 
InputStream through the `asInputStream` method. In this method, we use data 
stuctures like java.io.ByteArrayOutputStream and io.netty.buffer.ByteBuf, which 
are based on 32-bit integers (we can observe that NettyArrowBuf#length and 
ByteArrayOutputStream#count are both 32-bit integers). 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7699) [Java] Support concating dense union vectors in batch

2020-01-28 Thread Liya Fan (Jira)
Liya Fan created ARROW-7699:
---

 Summary: [Java] Support concating dense union vectors in batch
 Key: ARROW-7699
 URL: https://issues.apache.org/jira/browse/ARROW-7699
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


After supporting the dense union vector, we need to support concating dense 
union vectors in batch. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7506) [Java] JMH benchmarks should be called from main methods

2020-01-06 Thread Liya Fan (Jira)
Liya Fan created ARROW-7506:
---

 Summary: [Java] JMH benchmarks should be called from main methods
 Key: ARROW-7506
 URL: https://issues.apache.org/jira/browse/ARROW-7506
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


Some benchmarks are called as unit tests in our current code base. They should 
be called from main methods, because:

1. This is the recommended way of writing JMH benchmarks. The automatically 
generated benchmarks are called from main, and sample benchmarks provided by 
JMH [1] are also called from main.

2. Some compiler does not support calling JMH as unit test. For example, the 
"javac with error prone" reports the following error:

Error:(100, 15) java: [JUnit4TearDownNotRun] tearDown() method will not be run; 
please add JUnit's @After annotation
(see https://errorprone.info/bugpattern/JUnit4TearDownNotRun)
  Did you mean '@After'?

3. When run as a unit test, enable assert flag will be turned on by default, so 
some test/debug operations will be performed. This will distort the benchmark 
result data. For example, a related discussion can be found in [2].

[1] 
https://hg.openjdk.java.net/code-tools/jmh/file/tip/jmh-samples/src/main/java/org/openjdk/jmh/samples/
[2] https://github.com/apache/arrow/pull/5842#issuecomment-558082914



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7505) [Java] Remove Netty dependency for ArrowBuf

2020-01-06 Thread Liya Fan (Jira)
Liya Fan created ARROW-7505:
---

 Summary: [Java] Remove Netty dependency for ArrowBuf
 Key: ARROW-7505
 URL: https://issues.apache.org/jira/browse/ARROW-7505
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


This is part of the first step of issue ARROW-4526. 
In this step, we remove netty dependency for ArrowBuf, BufferAllocator and 
ReferenceManager. 

In this issue, we remove the dependency for ArrowBuf. 
The task for BufferAllocator and ReferenceManager will not start until 
ARROW-7329 is finished.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7491) [Java] Improve the performance of aligning

2020-01-01 Thread Liya Fan (Jira)
Liya Fan created ARROW-7491:
---

 Summary: [Java] Improve the performance of aligning
 Key: ARROW-7491
 URL: https://issues.apache.org/jira/browse/ARROW-7491
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


Aligning is an important and frequent operation when writing IPC data. It 
writes no more than 7 0 bytes to the output. 
The current implementation creates a new byte array each time, leading to 
performance overhead, and increases the GC pressure. 

We improve it by means of a shared byte array. Benchmark evaluation shows a 10% 
performance gain. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7469) [C++] Improve division related bit operations

2019-12-23 Thread Liya Fan (Jira)
Liya Fan created ARROW-7469:
---

 Summary: [C++] Improve division related bit operations
 Key: ARROW-7469
 URL: https://issues.apache.org/jira/browse/ARROW-7469
 Project: Apache Arrow
  Issue Type: Improvement
  Components: C++
Reporter: Liya Fan
Assignee: Liya Fan


Improve some operations in bit_util:

1. Eliminate one division for CeilDiv
2. Avoid overflow for RoundUp
3. Add a utility for CeilDiv(value, 8)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7437) [Java] ReadChannel#readFully does not set writer index correctly

2019-12-19 Thread Liya Fan (Jira)
Liya Fan created ARROW-7437:
---

 Summary: [Java] ReadChannel#readFully does not set writer index 
correctly
 Key: ARROW-7437
 URL: https://issues.apache.org/jira/browse/ARROW-7437
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


The writer index should be incremented by the amount of data actually read.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7429) [Java] Enhance code style checking for Java code (remove consecutive spaces)

2019-12-18 Thread Liya Fan (Jira)
Liya Fan created ARROW-7429:
---

 Summary: [Java] Enhance code style checking for Java code (remove 
consecutive spaces)
 Key: ARROW-7429
 URL: https://issues.apache.org/jira/browse/ARROW-7429
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


This issue is opened in response to a discussion in 
https://github.com/apache/arrow/pull/5861#discussion_r348917065.

We found the current style checking for Java code is not sufficient. So we want 
to enhace it in a series of "small" steps, in order to avoid having to change 
too many files at once. 

In this issue, we remove consecutive spaces between tokens, so that tokens are 
separated by single spaces. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7400) [Java] Avoids the worst case for quick sort

2019-12-16 Thread Liya Fan (Jira)
Liya Fan created ARROW-7400:
---

 Summary: [Java] Avoids the worst case for quick sort
 Key: ARROW-7400
 URL: https://issues.apache.org/jira/browse/ARROW-7400
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


This issue is in response of a discussion in: 
https://github.com/apache/arrow/pull/5540#discussion_r329487232.

The quick sort algorithm can degenerate to an O(n^2) algorithm, if the pivot is 
selected poorly. This is an important problem, as the worst case can happen, if 
the input vector is alrady sorted, which is frequently encountered in practice.

After some investigation, we solve the problem with a simple but effective 
approach: take 3 samples and choose the median (with at most 3 comparisons) as 
the pivot. This sorts the vector which is already sorted in O(nlogn) time. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7349) [C++] Fix the bug of parsing string hex values

2019-12-09 Thread Liya Fan (Jira)
Liya Fan created ARROW-7349:
---

 Summary: [C++] Fix the bug of parsing string hex values
 Key: ARROW-7349
 URL: https://issues.apache.org/jira/browse/ARROW-7349
 Project: Apache Arrow
  Issue Type: Bug
  Components: C++
Reporter: Liya Fan
Assignee: Liya Fan


std::lower_bound returns the end of the search range, when failing to find a 
match. 

The end of the search range is one position after the last valid position. So 
the value in this position is undefined, and we should not reference the value 
here to compare it with the target value. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7301) [Java] Sql type DATE should correspond to DateDayVector

2019-12-02 Thread Liya Fan (Jira)
Liya Fan created ARROW-7301:
---

 Summary: [Java] Sql type DATE should correspond to DateDayVector
 Key: ARROW-7301
 URL: https://issues.apache.org/jira/browse/ARROW-7301
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


According to SQL convertion, sql type DATE should correspond to a format of 
-MM-DD, without the components for hour/minute/second/millis

Therefore, JDBC type DATE should correspond to DateDayVector, with a type width 
of 4, instead of 8. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7277) [Document] Add discussion about vector lifecycle

2019-11-28 Thread Liya Fan (Jira)
Liya Fan created ARROW-7277:
---

 Summary: [Document] Add discussion about vector lifecycle
 Key: ARROW-7277
 URL: https://issues.apache.org/jira/browse/ARROW-7277
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


As discussed in 
https://issues.apache.org/jira/browse/ARROW-7254?focusedCommentId=16983284=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-16983284,
 we need a discussion about the lifecycle of a vector.

Each vector has a lifecycle, and different operations should be performed in 
particular phases of the lifecycle. If we violate this, some unexpected results 
may be produced. This may cause some confusion for Arrow users. So we want to 
add a new section to the prose document, to make it clear and explicit.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7216) [Java] Improve the performance of setting/clearing individual bits

2019-11-20 Thread Liya Fan (Jira)
Liya Fan created ARROW-7216:
---

 Summary: [Java] Improve the performance of setting/clearing 
individual bits
 Key: ARROW-7216
 URL: https://issues.apache.org/jira/browse/ARROW-7216
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


Setting/clearing individual bits are key operations for Arrow. In this issue, 
we improve the performance these operations by:

1. replacing arithmetic operations with bit-wise operations
2. remove unnecessary casts between int/byte
3. provide new API to remove the if branch

Benchmark results show that for clearing a bit, the performance improve by 11%, 
and for general set/clear operation, the performance improve by 4.7%:

before:
BitVectorHelperBenchmarks.setValidityBitBenchmarkavgt5  4.524 ± 
0.015  us/op

after:
BitVectorHelperBenchmarks.setValidityBitBenchmarkavgt5  4.313 ± 
0.011  us/op
BitVectorHelperBenchmarks.setValidityBitToZeroBenchmark  avgt5  4.020 ± 
0.016  us/op





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7213) [Java] Represent a data element of a vector as a tree of ArrowBufPointer

2019-11-19 Thread Liya Fan (Jira)
Liya Fan created ARROW-7213:
---

 Summary: [Java] Represent a data element of a vector as a tree of 
ArrowBufPointer
 Key: ARROW-7213
 URL: https://issues.apache.org/jira/browse/ARROW-7213
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


For a fixed/variable width vector, each of its data element can be represented 
as an ArrowBufPointer object, which represents a contiguous memory segment. 
This makes many tasks easier and more efficient (without memory copy): 
calculating hash code, comparing values, etc.

This cannot be achieved for complex vectors, because their values often reside 
in more than one contiguous memory regions. However, it can be seen that the 
contiguous memory regions for each data element forms a tree-like structure, 
whose leaf nodes are the contiguous memory regions. For example, a data element 
for a struct vector forms a tree, whose root corresponds to the struct vector, 
while the child vectors corresponds to the child nodes of the tree root. 

In this issue, we provide a data structure that represents each data element of 
a vector as a tree, whose leaf nodes are ArrowBufPointers, representing 
contiguous memory regions for the data element. 

With this data structure, many tasks also becomes easier and more efficient: 
calculating hash code, comparing vector elements (ordering & equality). In 
addition, we can do something that could not have been done in the past, like 
placing data elements into a hash table/hash set, etc. 




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7177) [Java] Provide a utility to improve the performance of vector loading/unloading

2019-11-15 Thread Liya Fan (Jira)
Liya Fan created ARROW-7177:
---

 Summary: [Java] Provide a utility to improve the performance of 
vector loading/unloading
 Key: ARROW-7177
 URL: https://issues.apache.org/jira/browse/ARROW-7177
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


Vector loading/unloading transforms a set of vectors to and from a set of 
buffers with meta data. It is heavily used in flight/IPC. 

In the loading/unloading operations, only the number of type buffers are really 
needed. However, the current code logic gets a copy of the type buffers, which 
is not necessary.

In this issue, we provide a utility to get the number of type buffers, given an 
arrow type. It improves the performance by 

1. avoiding creating objects unnecessarily.
2. avoiding list copying for vector unloading (which calls 
TypeLayout#getBufferTypes).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7166) [Java] Remove redundant code for Jdbc adapters

2019-11-13 Thread Liya Fan (Jira)
Liya Fan created ARROW-7166:
---

 Summary: [Java] Remove redundant code for Jdbc adapters
 Key: ARROW-7166
 URL: https://issues.apache.org/jira/browse/ARROW-7166
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


As discussed in 
https://github.com/apache/arrow/pull/5508#issuecomment-543011016, we need a 
separate issue to extract common logic to a common super class. 

This makes the code clearer, and we need to make sure we have no performance 
regression. 




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7106) [Java] Fix the problem that flight perf test hangs endlessly

2019-11-10 Thread Liya Fan (Jira)
Liya Fan created ARROW-7106:
---

 Summary: [Java] Fix the problem that flight perf test hangs 
endlessly
 Key: ARROW-7106
 URL: https://issues.apache.org/jira/browse/ARROW-7106
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


Flight performance test (org.apache.arrow.flight.perf.TestPerf) is an important 
tool for tracking the current throughput of IPC. In this issue, we improve it 
in two ways:

1. We fix the problem that the test hangs endlessly after all runs have been 
finished. This is because the thread pool is not released.

2. We add a summary to the output report, so that we can easily evaluate the 
overall results for all runs. 




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7073) [Java] Support concating vectors values in batch

2019-11-06 Thread Liya Fan (Jira)
Liya Fan created ARROW-7073:
---

 Summary: [Java] Support concating vectors values in batch
 Key: ARROW-7073
 URL: https://issues.apache.org/jira/browse/ARROW-7073
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


We need a way to copy vector values in batch. Currently, we have copyFrom and 
copyFromSafe APIs. However, they are not enough, as copying values individually 
is not performant. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7072) [Java] Support concating validity bits efficiently

2019-11-06 Thread Liya Fan (Jira)
Liya Fan created ARROW-7072:
---

 Summary: [Java] Support concating validity bits efficiently
 Key: ARROW-7072
 URL: https://issues.apache.org/jira/browse/ARROW-7072
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


For scenarios when we need to concate vectors (like the scenario in ARROW-7048, 
and delta dictionary), we need a way to concat validity bits. 

Currently, we have bit level API to read/write individual validity bit. 
However, it is not efficient , and we need a way to copy more bits at a time. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7020) [Java] Fix the bugs when calculating vector hash code

2019-10-29 Thread Liya Fan (Jira)
Liya Fan created ARROW-7020:
---

 Summary: [Java] Fix the bugs when calculating vector hash code
 Key: ARROW-7020
 URL: https://issues.apache.org/jira/browse/ARROW-7020
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


When calculating the hash code for a value in the vector, the validity bit must 
be taken into account.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-7019) [Java] Improve the performance of loading validity buffers

2019-10-29 Thread Liya Fan (Jira)
Liya Fan created ARROW-7019:
---

 Summary: [Java] Improve the performance of loading validity buffers
 Key: ARROW-7019
 URL: https://issues.apache.org/jira/browse/ARROW-7019
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


At the receiver side of flighting, loading validity buffer is an important 
operation, as each vector has a validity buffer. 

For non-nullable vectors, the current implementation of loading the validity 
buffer is inefficient.  We improve the performance of this operation by 
efficiently setting the bits of a memory region to 1. 

Benchmark results show that the changes leads to a 35% performance improvement:

Before:
BitVectorHelperBenchmarks.loadValidityBufferAllOne  avgt5  748.916 ± 23.290 
 ns/op

After:
BitVectorHelperBenchmarks.loadValidityBufferAllOne  avgt5  487.352 ± 15.046 
 ns/op




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6935) [Java] Improve the performance of comparing two blocks of heap data

2019-10-18 Thread Liya Fan (Jira)
Liya Fan created ARROW-6935:
---

 Summary: [Java] Improve the performance of comparing two blocks of 
heap data
 Key: ARROW-6935
 URL: https://issues.apache.org/jira/browse/ARROW-6935
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


Implement methods to compare data word by word, instead of byte by byte.
Benchmarks shows that there is a 4.5x performance improvement:

ByteFunctionHelpersBenchmarks.builtInByteArrayEquals  avgt5  437.504 ± 
1.120  ns/op
ByteFunctionHelpersBenchmarks.byteArrayEquals avgt5   97.700 ± 
0.178  ns/op



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6933) [Java] Suppor linear dictionary encoder

2019-10-18 Thread Liya Fan (Jira)
Liya Fan created ARROW-6933:
---

 Summary: [Java] Suppor linear dictionary encoder
 Key: ARROW-6933
 URL: https://issues.apache.org/jira/browse/ARROW-6933
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


For many scenarios, the distribution of dictionary entries is highly skewed. In 
other words, a few dictionary entries occurs much more frequently than others. 
If we can sort the dictionary by the non-increasing order of entry frequencies, 
and compare each value to encode from the beginning of the dictionary, we get 
the following benefits:

1)  We need no extra memory space or data structure.
2)  The search is extremely efficient, as we are likely to find a match in 
the first few entries of the dictionary.

This is the basic idea behind the linear dictionary encoder. When the scenario 
is right (highly skewed dictionary distribution), it outperforms both search 
based encoder and hash table based encoders. 




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6911) [Java] Provide composite comparator

2019-10-16 Thread Liya Fan (Jira)
Liya Fan created ARROW-6911:
---

 Summary: [Java] Provide composite comparator
 Key: ARROW-6911
 URL: https://issues.apache.org/jira/browse/ARROW-6911
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


A composite comparator is a sub-class of VectorValueComparator that contains an 
array of inner comparators, with each comparator corresponding to one column 
for comparison. It can be used to support sort/comparison operations for 
VectorSchemaRoot/StructVector.

The composite comparator works like this: it first uses the first internal 
comparator (for the primary sort key) to compare vector values. If it gets a 
non-zero value, we just return it; otherwise, we use the second comparator to 
break the tie, and so on, until a non-zero value is produced by some internal 
comparator, or all internal comparators have been used. 




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6896) [Java] Vector schema root should not share vectors

2019-10-15 Thread Liya Fan (Jira)
Liya Fan created ARROW-6896:
---

 Summary: [Java] Vector schema root should not share vectors
 Key: ARROW-6896
 URL: https://issues.apache.org/jira/browse/ARROW-6896
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


Vector schema root should not share vectors. Otherwise, unexpectd behavior 
would happen. 

Please note that VectorSchemaRoot is not just a container for vectors, it is 
also a resource (it implements the AutoClosable interface), and it manages the 
life cycle of its inner vectors.

When two VectorSchemaRoots share vectors, something unexpected may happen. 
Consider the following scenario, which is frequently encountered in a SQL 
engine.

1. We create a batch:
VectorSchemaRoot oldBatch = ...

2. We add a vector to it, which results in a new batch
VectorSchemaRoot newBatch = oldBatch.addVector(vector);

3. We are done with the old batch, and release the resource
oldBatch.close();

4. We continue to use the new batch, but gets an exception, because some inner 
vectors have been released by the old batch. 





--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6888) [Java] Support copy operation for vector value comparators

2019-10-15 Thread Liya Fan (Jira)
Liya Fan created ARROW-6888:
---

 Summary: [Java] Support copy operation for vector value comparators
 Key: ARROW-6888
 URL: https://issues.apache.org/jira/browse/ARROW-6888
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


In this issue, we provide copy operations for vector value comparators. This 
operation creates another comparator with the same type and comparison logic.

This feature is useful in multi-threading scenarios where multiple threads uses 
the comparator to perform their own task. In this scenario, we have no way of 
making sure the compare method is thread safe. So a safe way is to create a new 
comparator for each thread. The copy operation will support this.

An immediate application of this is the parallel searcher for ordering 
semantics. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6866) [Java] Improve the performance of calculating hash code for struct vector

2019-10-12 Thread Liya Fan (Jira)
Liya Fan created ARROW-6866:
---

 Summary: [Java] Improve the performance of calculating hash code 
for struct vector
 Key: ARROW-6866
 URL: https://issues.apache.org/jira/browse/ARROW-6866
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


Improve the performance of hashCode(int) method for StructVector:
1. We can get the child vectors directly, so there is no need to get the name 
from the child vector and then use the name to get the vector. 
2. The child vectors cannot be null, so there is no need to check it.

The performance improvement depends on the complexity of the hash algorithm. 
For computational intensive hash algorithms, the improvement can be small; 
while for simple hash algorithms, the improvement can be notable. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6865) [Java] Improve the performance of comparing an ArrowBuf against a byte array

2019-10-12 Thread Liya Fan (Jira)
Liya Fan created ARROW-6865:
---

 Summary: [Java] Improve the performance of comparing an ArrowBuf 
against a byte array
 Key: ARROW-6865
 URL: https://issues.apache.org/jira/browse/ARROW-6865
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


We change the way of comparing an ArrowBuf against a byte array from byte wise 
comparison to comparison by long/int/byte.

Benchmark shows that there is a 6.7x performance improvement. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6863) [Java] Provide parallel searcher

2019-10-12 Thread Liya Fan (Jira)
Liya Fan created ARROW-6863:
---

 Summary: [Java] Provide parallel searcher
 Key: ARROW-6863
 URL: https://issues.apache.org/jira/browse/ARROW-6863
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


For scenarios where the vector is large and the a low response time is 
required, we need to search the vector in parallel to improve the 
responsiveness.

This issue tries to provide a parallel searcher for the equality semantics (the 
support for ordering semantics is not ready yet, as we need a way to distribute 
the comparator).

The implementation is based on multi-threading.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6738) [Java] Fix problems with current union comparison logic

2019-09-30 Thread Liya Fan (Jira)
Liya Fan created ARROW-6738:
---

 Summary: [Java] Fix problems with current union comparison logic
 Key: ARROW-6738
 URL: https://issues.apache.org/jira/browse/ARROW-6738
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


There are some problems with the current union comparison logic. For example:
1. For type check, we should not require fields to be equal. It is possible 
that two vectors' value ranges are equal but their fields are different.
2. We should not compare the number of sub vectors, as it is possible that two 
union vectors have different numbers of sub vectors, but have equal values in 
the range.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6732) [Java] Implement quick sort in a non-recursive way to avoid stack overflow

2019-09-29 Thread Liya Fan (Jira)
Liya Fan created ARROW-6732:
---

 Summary: [Java] Implement quick sort in a non-recursive way to 
avoid stack overflow
 Key: ARROW-6732
 URL: https://issues.apache.org/jira/browse/ARROW-6732
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


The current quick sort algorithm in implemented by a recursive algorithm. The 
problem is that for the worst case, the number of recursive layers is equal to 
the length of the vector.  For large vectors, this will cause stack overflow.

To solve this problem, we implement the quick sort algorithm as a non-recursive 
algorithm.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6723) [Java] Reduce the range of synchronized block when releasing an ArrowBuf

2019-09-27 Thread Liya Fan (Jira)
Liya Fan created ARROW-6723:
---

 Summary: [Java] Reduce the range of synchronized block when 
releasing an ArrowBuf
 Key: ARROW-6723
 URL: https://issues.apache.org/jira/browse/ARROW-6723
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


When releasing an ArrowBuf, we will run the following piece of code:

  private int decrement(int decrement) {
allocator.assertOpen();
final int outcome;
synchronized (allocationManager) {
  outcome = bufRefCnt.addAndGet(-decrement);
  if (outcome == 0) {
lDestructionTime = System.nanoTime();
allocationManager.release(this);
  }
}
return outcome;
  }

It can be seen that we need to acquire the lock for allocation manager lock, no 
matter if we need to release the buffer. In addition, the operation of 
decrementing refcount is only carried out after the lock is acquired. This 
leads to unnecessary resource contention, and may degrade performance. 

We propose to change the code like this:

  private int decrement(int decrement) {
allocator.assertOpen();
final int outcome;
outcome = bufRefCnt.addAndGet(-decrement);
if (outcome == 0) {
  lDestructionTime = System.nanoTime();
  synchronized (allocationManager) {
allocationManager.release(this);
  }
}
return outcome;
  }

Note that this change can be dangerous, as it lies in the core of our code 
base, so we should be careful with it. On the other hand, it may have 
non-trivial performance implication. As far as I know, when a distributed task 
is getting closed, a large number of ArrowBuf will be closed simultaneously. If 
we reduce the range of the synchronization block, we can significantly improve 
the performance. 

What do you think?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6722) [Java] Provide a uniform way to get vector name

2019-09-27 Thread Liya Fan (Jira)
Liya Fan created ARROW-6722:
---

 Summary: [Java] Provide a uniform way to get vector name
 Key: ARROW-6722
 URL: https://issues.apache.org/jira/browse/ARROW-6722
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


Currently, the getName method is defined in BaseValueVector, as an abstract 
class. However, some vector does not extend the BaseValueVector, like 
StructVector, UnionVector, ZeroVector.
In this issue, we move the method to ValueVector interface, the base interface 
for all vectors.
This makes it easier to get a vector's name without checking its type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6672) [Java] Extract a common interface for dictionary builders

2019-09-24 Thread Liya Fan (Jira)
Liya Fan created ARROW-6672:
---

 Summary: [Java] Extract a common interface for dictionary builders
 Key: ARROW-6672
 URL: https://issues.apache.org/jira/browse/ARROW-6672
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


We need a common interface for dictionary builders to support more 
sophisticated scenarios, like collecting dictionary statistics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6598) [Java] Sort the code for ApproxEqualsVisitor

2019-09-18 Thread Liya Fan (Jira)
Liya Fan created ARROW-6598:
---

 Summary: [Java] Sort the code for ApproxEqualsVisitor
 Key: ARROW-6598
 URL: https://issues.apache.org/jira/browse/ARROW-6598
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


As a follow up issue of ARROW-6458, we finalize the code for 
ApproxEqualsVisitor.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (ARROW-6580) [Java] Support comparison for unsigned integers

2019-09-17 Thread Liya Fan (Jira)
Liya Fan created ARROW-6580:
---

 Summary: [Java] Support comparison for unsigned integers
 Key: ARROW-6580
 URL: https://issues.apache.org/jira/browse/ARROW-6580
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


In this issue, we support the comparison of unsigned integer vectors, including 
UInt1Vector, UInt2Vector, UInt4Vector, and UInt8Vector.
With support for comparison for these vectors, the sort for them is also 
supported automatically.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6458) [Java] Improve the performance and code structure for ApproxEqualsVisitor

2019-09-04 Thread Liya Fan (Jira)
Liya Fan created ARROW-6458:
---

 Summary: [Java] Improve the performance and code structure for 
ApproxEqualsVisitor
 Key: ARROW-6458
 URL: https://issues.apache.org/jira/browse/ARROW-6458
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


As discussed in 
https://github.com/apache/arrow/pull/5195#issuecomment-526157961, there are 
some problems with the current ways of comparing floating point vectors, we 
solve them in this PR:

1. there are if statements/duplicated members in ApproxEqualsVisitor, making 
the code redundant and less clear.
2. the comparion of float4 and float8 are based on wrapped objects Float and 
Double, which may have performance penalty.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6420) [Java] Improve the performance of UnionVector when getting underlying vectors

2019-09-02 Thread Liya Fan (Jira)
Liya Fan created ARROW-6420:
---

 Summary: [Java] Improve the performance of UnionVector when 
getting underlying vectors
 Key: ARROW-6420
 URL: https://issues.apache.org/jira/browse/ARROW-6420
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


Getting the underlying vector is a frequent opertation for UnionVector. It 
relies on this operation to get/set data at each index.

The current implementation is inefficient. In particular, it first gets the 
minor type at the given index, and then compares it against all possible minor 
types in a switch statment, until a match is found.

We improve the performance by storing the internal vectors in an array, whose 
index is the ordinal of the minor type. So given a minor type, its 
corresponding underlying vector can be obtained in O(1) time.

It should be noted that this technique is also applicable to UnionReader and 
UnionWriter, and support for UnionReader is already implemented.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6394) [Java] Support conversions between delta vector and partial sum vector

2019-08-30 Thread Liya Fan (Jira)
Liya Fan created ARROW-6394:
---

 Summary: [Java] Support conversions between delta vector and 
partial sum vector
 Key: ARROW-6394
 URL: https://issues.apache.org/jira/browse/ARROW-6394
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


What is a delta vector/partial sum vector?

Given an integer vector a with length n, its partial sum vector is another 
integer vector b with length n + 1, with values defined as:

b(0) = initial sum
b(i) = a(0) + a(1) + ... + a(i - 1) i = 1, 2, ..., n

Given an integer vector with length n + 1, its delta vector is another integer 
vector b with length n, with values defined as:

b(i) = a(i) - a(i - 1), i = 0, 1, ... , n -1

In this issue, we provide utilities to convert between vector and partial sum 
vector. It is interesting to note that the two operations corresponding to the 
discrete integration and differentian.

These conversions have wide applications. For example,

1. The run-length vector proposed by Micah is based on the partial sum vector, 
while the deduplication functionality is based on delta vector. This issue 
provides conversions between them.

2. The current VarCharVector/VarBinaryVector implementations are based on 
partial sum vector. We can transform them to delta vectors before IPC, to 
reduce network traffic.

3. Converting to delta can be considered as a way for data compression. To 
further reduce the data volume, the operation can be applied more than once, to 
further reduce data volume.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6374) [Java] Refactor the code for TimeXXVectors

2019-08-28 Thread Liya Fan (Jira)
Liya Fan created ARROW-6374:
---

 Summary: [Java] Refactor the code for TimeXXVectors
 Key: ARROW-6374
 URL: https://issues.apache.org/jira/browse/ARROW-6374
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


This is based on the discussion in 
[https://lists.apache.org/thread.html/836d3b87ccb6e65e9edf0f220829a29edfa394fc2cd1e0866007d86e@%3Cdev.arrow.apache.org%3E.|https://lists.apache.org/thread.html/836d3b87ccb6e65e9edf0f220829a29edfa394fc2cd1e0866007d86e@%3Cdev.arrow.apache.org%3E,]

 

The internals of TimeXXVectors are simply IntVector or BigIntVector. There are 
duplicated code for setting/getting int/long.

 

We want to refactor the code by:
 # push get/set methods into the base class BaseFixedWidthVector, and make them 
protected.
 # The APIs in TimeXXVectors references the methods in the base class.

 

Note that this issue not just reduce redundant code, it also centralizes the 
logics for getting/setting int/long, making them easy to maintain and change.

 

If it looks good, later we will make other integer based vectors rely on the 
base class implementations. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6366) [Java] Make field vectors final explicitly

2019-08-26 Thread Liya Fan (Jira)
Liya Fan created ARROW-6366:
---

 Summary: [Java] Make field vectors final explicitly
 Key: ARROW-6366
 URL: https://issues.apache.org/jira/browse/ARROW-6366
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


According to the discussion in 
[https://lists.apache.org/thread.html/836d3b87ccb6e65e9edf0f220829a29edfa394fc2cd1e0866007d86e@%3Cdev.arrow.apache.org%3E,]
 field vectors should not be extended, so they should be made final explicitly.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6355) [Java] Make range equal visitor reusable

2019-08-26 Thread Liya Fan (Jira)
Liya Fan created ARROW-6355:
---

 Summary: [Java] Make range equal visitor reusable
 Key: ARROW-6355
 URL: https://issues.apache.org/jira/browse/ARROW-6355
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


According to the discussion in 
[https://github.com/apache/arrow/pull/4993#discussion_r316009165,] we often 
encountered this scenario: we compare values repeatedly. The comparisons 
differs only in the parameters (vector to compare, start index, etc).

 

According to the current API, we have to create a new RangeEqualVisitor object 
each time the comparison is performed. This leads to non-trivial performance 
overhead.

 

To address this problem, we make the RangeEqualVisitor reusable, and allow the 
client to change parameters of an existing visitor. 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6335) [Java] Improve the performance of DictionaryHashTable

2019-08-23 Thread Liya Fan (Jira)
Liya Fan created ARROW-6335:
---

 Summary: [Java] Improve the performance of DictionaryHashTable
 Key: ARROW-6335
 URL: https://issues.apache.org/jira/browse/ARROW-6335
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


when comparing two entries in the dictionary hash table, it is more efficient 
to compare the index directly, rather than using Objects.equals, because they 
are both ints.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6334) [Java] Improve the dictionary builder API to return the position of the value in the dictionary

2019-08-23 Thread Liya Fan (Jira)
Liya Fan created ARROW-6334:
---

 Summary: [Java] Improve the dictionary builder API to return the 
position of the value in the dictionary
 Key: ARROW-6334
 URL: https://issues.apache.org/jira/browse/ARROW-6334
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


This is an improvement of the {{addValue}} method.

Previously, the method returns a boolean, indicating if the value has been 
successfully added to the dictionary.

After the change, the method returns an integer, which is the position of the 
value in the dictionary.

The purpose of this change:
 # the dictionary position contains more information, compared with a boolean 
indicating if the value is added successfully.
 # this information about the index in the dictionary can be useful, for 
example, to collect statistics about the dictionary.

With the dictionary position, the information about if a value has been added 
can be easily determined.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6307) [Java] Provide RLE vector

2019-08-21 Thread Liya Fan (Jira)
Liya Fan created ARROW-6307:
---

 Summary: [Java] Provide RLE vector
 Key: ARROW-6307
 URL: https://issues.apache.org/jira/browse/ARROW-6307
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


RLE (run length encoding) is a widely used encoding/decoding technique. 
Compared with other encoding/decoding techniques, it is easier to work with the 
encoded data. 
 
We want to provide an RLE vector implementation in Arrow. The design details 
include:
 
1. RleVector implements ValueVector.
2. the data structure of RleVector includes an inner vector, plus a repetition 
buffer. 
3. we do not provide random access over the RleVector
4. In the future, we will provide iterators to access the vector in sequence.
5. RleVector does not support update, but supports appending.
6. In the future, we will provide encoder/decoder to efficiently transform 
encoded/decoded vectors.
 



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6306) [Java] Support stable sort by stable comparators

2019-08-21 Thread Liya Fan (Jira)
Liya Fan created ARROW-6306:
---

 Summary: [Java] Support stable sort by stable comparators
 Key: ARROW-6306
 URL: https://issues.apache.org/jira/browse/ARROW-6306
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


Stable sort is desirable in many scenarios. It means equal elements preserve 
their relative order after sorting.

There are stable sort algorithms. However, in practice, the best sort algorithm 
is quick sort and quick sort is not stable. 

To make the best of both worlds, we support stable sort by stable comparators. 
It differs from an ordinary comparator in that it breaks ties by comparing the 
value indices.

With the stable comparator, the quick sort algorithm becomes a stable algorithm.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6297) [Java] Compare ArrowBufPointers by unsinged integers

2019-08-20 Thread Liya Fan (Jira)
Liya Fan created ARROW-6297:
---

 Summary: [Java] Compare ArrowBufPointers by unsinged integers
 Key: ARROW-6297
 URL: https://issues.apache.org/jira/browse/ARROW-6297
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


Currently, ArrowBufPointers compare by bytes in lexicographic order. Another 
way is to compare by unsigned integers (longs, ints, & bytes). 

The second way involves additional bit operations for each iteration. However, 
it can compare 8 bytes at a time. So it is overall faster:

 

Compare by unsigned integers:

ArrowBufPointerBenchmarks.compareBenchmark avgt 5 65.722 ± 0.381 ns/op

 

Compare byte-wise:
ArrowBufPointerBenchmarks.compareBenchmark avgt 5 681.372 ± 0.604 ns/op



--
This message was sent by Atlassian Jira
(v8.3.2#803003)


[jira] [Created] (ARROW-6266) [Java] Resolve the ambiguous method overload in RangeEqualsVisitor

2019-08-16 Thread Liya Fan (JIRA)
Liya Fan created ARROW-6266:
---

 Summary: [Java] Resolve the ambiguous method overload in 
RangeEqualsVisitor
 Key: ARROW-6266
 URL: https://issues.apache.org/jira/browse/ARROW-6266
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


In RangeEqualsVisitor, there are overload methods for both super class and sub 
class. This will lead to unexpected behavior.

For example, if we call RangeEqualsVisitor#visit(v), where v is a fixed width 
vector, the method actually called may be visit(ValueVector), which is 
unexpected.

In general, in the visitor pattern, it is not a good idea to support method 
overload for both super class and sub-class as parameters.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6264) [Java] There is no need to consider byte order in ArrowBufHasher

2019-08-15 Thread Liya Fan (JIRA)
Liya Fan created ARROW-6264:
---

 Summary: [Java] There is no need to consider byte order in 
ArrowBufHasher
 Key: ARROW-6264
 URL: https://issues.apache.org/jira/browse/ARROW-6264
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


According to the discussion in 
[https://github.com/apache/arrow/pull/5063#issuecomment-521276547|https://github.com/apache/arrow/pull/5063#issuecomment-521276547.],
 Arrow has a mechanism to make sure the data is stored in little-endian, so 
there is no need to check byte order.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6247) [Java] Provide a common interface for float4 and float8 vectors

2019-08-14 Thread Liya Fan (JIRA)
Liya Fan created ARROW-6247:
---

 Summary: [Java] Provide a common interface for float4 and float8 
vectors
 Key: ARROW-6247
 URL: https://issues.apache.org/jira/browse/ARROW-6247
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


We want to provide an interface for floating point vectors (float4 & float8). 
This interface will make it convenient for many operations on a vector. With 
this interface, the client code will be greatly simplified, with many 
branches/switch removed.

 

The design is similar to BaseIntVector (the interface for all integer vectors). 
We provide 3 methods for setting & getting floating point values:

 setWithPossibleTruncate

 setSafeWithPossibleTruncate

 getValueAsDouble



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6245) [DISCUSS][Java] Provide an interface for numeric vectors

2019-08-14 Thread Liya Fan (JIRA)
Liya Fan created ARROW-6245:
---

 Summary: [DISCUSS][Java] Provide an interface for numeric vectors
 Key: ARROW-6245
 URL: https://issues.apache.org/jira/browse/ARROW-6245
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


We want to provide an interface for all vectors with numeric types (small int, 
float4, float8, etc). This interface will make it convenient for many 
operations on a vector, like average, sum, variance, etc. With this interface, 
the client code will be greatly simplified, with many branches/switch removed.

 

The design is similar to BaseIntVector (the interface for all integer vectors). 
We provide 3 methods for setting & getting numeric values:

 setWithPossibleRounding

 setSafeWithPossibleRounding

 getValueAsDouble



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6221) [Java] Improve the performance of RangeEqualVisitor for comparing variable-width vectors

2019-08-12 Thread Liya Fan (JIRA)
Liya Fan created ARROW-6221:
---

 Summary: [Java] Improve the performance of RangeEqualVisitor for 
comparing variable-width vectors
 Key: ARROW-6221
 URL: https://issues.apache.org/jira/browse/ARROW-6221
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


Two improvements:
 # Compare the whole range of the data buffer, instead of comparing individual 
elements.
 # If two elements are of different sizes, there is no need to compare them.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6212) [Java] Support vector rank operation

2019-08-12 Thread Liya Fan (JIRA)
Liya Fan created ARROW-6212:
---

 Summary: [Java] Support vector rank operation
 Key: ARROW-6212
 URL: https://issues.apache.org/jira/browse/ARROW-6212
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


Given an unsorted vector, we want to get the index of the ith smallest element 
in the vector. This function is supported by the rank operation. 

We provide an implementation that gets the index with the desired rank, without 
sorting the vector (the vector is left intact), and the implementation takes 
O(n) time, where n is the vector length.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6209) [Java] Extract set null method to the base class for fixed width vectors

2019-08-12 Thread Liya Fan (JIRA)
Liya Fan created ARROW-6209:
---

 Summary: [Java] Extract set null method to the base class for 
fixed width vectors
 Key: ARROW-6209
 URL: https://issues.apache.org/jira/browse/ARROW-6209
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


Currently, each fixed width vector has the setNull method. All these 
implementations are identical, so we move them to the base class. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6185) [Java] Provide hash table based dictionary builder

2019-08-09 Thread Liya Fan (JIRA)
Liya Fan created ARROW-6185:
---

 Summary: [Java] Provide hash table based dictionary builder
 Key: ARROW-6185
 URL: https://issues.apache.org/jira/browse/ARROW-6185
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


This is related ARROW-5862. We provide another type of dictionary builder based 
on hash table. Compared with a search based dictionary encoder, a hash table 
based encoder process each new element in O(1) time, but require extra memory 
space.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6184) [Java] Provide hash table based dictionary encoder

2019-08-09 Thread Liya Fan (JIRA)
Liya Fan created ARROW-6184:
---

 Summary: [Java] Provide hash table based dictionary encoder
 Key: ARROW-6184
 URL: https://issues.apache.org/jira/browse/ARROW-6184
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


This is the second part of ARROW-5917. We provide a sort based encoder, as well 
as a hash table based encoder, to solve the problem with the current dictionary 
encoder. 

In particular, we solve the following problems with the current encoder:
 # There are repeated conversions between Java objects and bytes (e.g. 
vector.getObject(i)).
 # Unnecessary memory copy (the vector data must be copied to the hash table).
 # The hash table cannot be reused for encoding multiple vectors (other data 
structure & results cannot be reused either).
 # The output vector should not be created/managed by the encoder (just like in 
the out-of-place sorter)
 # The hash table requires that the hashCode & equals methods be implemented 
appropriately, but this is not guaranteed.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6172) [Java] Avoid creating value holders repeatedly when reading data from JDBC

2019-08-08 Thread Liya Fan (JIRA)
Liya Fan created ARROW-6172:
---

 Summary: [Java] Avoid creating value holders repeatedly when 
reading data from JDBC
 Key: ARROW-6172
 URL: https://issues.apache.org/jira/browse/ARROW-6172
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


When converting JDBC data to Arrow data. A value holder is created for each 
single value. The following code snippet gives an example:

NullableSmallIntHolder holder = new NullableSmallIntHolder();
 holder.isSet = isNonNull ? 1 : 0;
 if (isNonNull) {
 holder.value = (short) value;
 }
 smallIntVector.setSafe(rowCount, holder);
 smallIntVector.setValueCount(rowCount + 1);

 

This is inefficient, both in terms of memory usage, and computational 
efficiency. 

For most types, we can improve the performance by directly setting the value.

For example, the benchmarks on IntVector show that a 20% performance 
improvement can be achieved by directly setting the int value:

 

Benchmark Mode Cnt Score Error Units
IntBenchmarks.setIntDirectly avgt 5 15.397 ± 0.018 us/op
IntBenchmarks.setWithValueHolder avgt 5 19.198 ± 0.789 us/op

 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6156) [Java] Support compare semantics for ArrowBufPointer

2019-08-07 Thread Liya Fan (JIRA)
Liya Fan created ARROW-6156:
---

 Summary: [Java] Support compare semantics for ArrowBufPointer
 Key: ARROW-6156
 URL: https://issues.apache.org/jira/browse/ARROW-6156
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


Compare two arrow buffer pointers by their content in lexicographic order.

null is smaller and shorter buffer is smaller.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6155) [Java] Extract a super interface for vectors whose elements reside in continuous memory segments

2019-08-06 Thread Liya Fan (JIRA)
Liya Fan created ARROW-6155:
---

 Summary: [Java] Extract a super interface for vectors whose 
elements reside in continuous memory segments
 Key: ARROW-6155
 URL: https://issues.apache.org/jira/browse/ARROW-6155
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Liya Fan
Assignee: Liya Fan


For vectors whose data elements reside in continuous memory segments, they 
should implement a common super interface. This will avoid unnecessary code 
branches.

For now, such vectors include fixed-width vectors and variable-width vectors. 
In the future, there can be more vectors included.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6143) [Java] Unify the copyFrom and copyFromSafe methods for all vectors

2019-08-05 Thread Liya Fan (JIRA)
Liya Fan created ARROW-6143:
---

 Summary: [Java] Unify the copyFrom and copyFromSafe methods for 
all vectors
 Key: ARROW-6143
 URL: https://issues.apache.org/jira/browse/ARROW-6143
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


Some vectors have their own implementations of copyFrom and copyFromSafe 
methods. 

Since we have extracted the copyFrom and copyFromSafe methods to the base 
interface (see ARROW-6021), we want all vectors' implementations to override 
the methods from the super interface.

This will provide a unified way of copying data elements. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6117) [Java] Fix the set method of FixedSizeBinaryVector

2019-08-02 Thread Liya Fan (JIRA)
Liya Fan created ARROW-6117:
---

 Summary: [Java] Fix the set method of FixedSizeBinaryVector
 Key: ARROW-6117
 URL: https://issues.apache.org/jira/browse/ARROW-6117
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


For the set method, if the parameter is null, it should clear the validity bit. 
However, the current implementation throws a NullPointerException.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6080) [Java] Support search operation for BaseRepeatedValueVector

2019-07-31 Thread Liya Fan (JIRA)
Liya Fan created ARROW-6080:
---

 Summary: [Java] Support search operation for 
BaseRepeatedValueVector
 Key: ARROW-6080
 URL: https://issues.apache.org/jira/browse/ARROW-6080
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan






--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6070) [Java] Avoid creating new schema before IPC sending

2019-07-30 Thread Liya Fan (JIRA)
Liya Fan created ARROW-6070:
---

 Summary: [Java] Avoid creating new schema before IPC sending
 Key: ARROW-6070
 URL: https://issues.apache.org/jira/browse/ARROW-6070
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


If a dictionary is attached to a schema, it may need to be converted before IPC 
sending. When this is not the case (which is most likely in practice), there is 
no need to do the conversion and no need to create a new schema. 

We solve the above problem by quickly determining if conversion is required, 
and if not, we avoid creating a new schema and return the original one 
immediately.

 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6056) [Java] Handle exceptions when flight service processes put requests

2019-07-27 Thread Liya Fan (JIRA)
Liya Fan created ARROW-6056:
---

 Summary: [Java] Handle exceptions when flight service processes 
put requests
 Key: ARROW-6056
 URL: https://issues.apache.org/jira/browse/ARROW-6056
 Project: Apache Arrow
  Issue Type: Improvement
Reporter: Liya Fan


The current way of processing is to swallow the exception silently and print a 
log. However, this way is not friendly to debugging and problem diagnosis. We 
need a way to process it explicitly. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6031) [Java] Support iterating a vector by ArrowBufPointer

2019-07-24 Thread Liya Fan (JIRA)
Liya Fan created ARROW-6031:
---

 Summary: [Java] Support iterating a vector by ArrowBufPointer
 Key: ARROW-6031
 URL: https://issues.apache.org/jira/browse/ARROW-6031
 Project: Apache Arrow
  Issue Type: New Feature
Reporter: Liya Fan
Assignee: Liya Fan


Provide the functionality to traverse a vector (fixed-width vector & 
variable-width vector) by an iterator. This is convenient for scenarios when 
accessing vector elements in sequence.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6030) [Java] Efficiently compute hash code for ArrowBufPointer

2019-07-24 Thread Liya Fan (JIRA)
Liya Fan created ARROW-6030:
---

 Summary: [Java] Efficiently compute hash code for ArrowBufPointer
 Key: ARROW-6030
 URL: https://issues.apache.org/jira/browse/ARROW-6030
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


As ArrowBufHasher is introduced, we can compute the hash code of a continuous 
region within an ArrowBuf. 

We optimize the process to make it efficient to avoid recomputation. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6024) [Java] Provide more hash algorithms

2019-07-24 Thread Liya Fan (JIRA)
Liya Fan created ARROW-6024:
---

 Summary: [Java] Provide more hash algorithms 
 Key: ARROW-6024
 URL: https://issues.apache.org/jira/browse/ARROW-6024
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


Provide more hash algorithms to choose for different scenarios.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-6021) [Java] Extract copyFrom and copyFromSafe to ValueVector

2019-07-24 Thread Liya Fan (JIRA)
Liya Fan created ARROW-6021:
---

 Summary: [Java] Extract copyFrom and copyFromSafe to ValueVector
 Key: ARROW-6021
 URL: https://issues.apache.org/jira/browse/ARROW-6021
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


Currently we have copyFrom and copyFromSafe methods in fixed-width and 
variable-width vectors. Extracting them to the common super interface will make 
it much more convenient to use them, and avoid unnecessary if-else statements.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5998) [Java] Open a document to track the API changes

2019-07-22 Thread Liya Fan (JIRA)
Liya Fan created ARROW-5998:
---

 Summary: [Java] Open a document to track the API changes
 Key: ARROW-5998
 URL: https://issues.apache.org/jira/browse/ARROW-5998
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


We need a document to track the API behavior changes, so as not forget about 
them for the next release.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5996) [Java] Avoid resource leak in flight service

2019-07-22 Thread Liya Fan (JIRA)
Liya Fan created ARROW-5996:
---

 Summary: [Java] Avoid resource leak in flight service
 Key: ARROW-5996
 URL: https://issues.apache.org/jira/browse/ARROW-5996
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


# In FlightService#doPutCustom, the flight stream must be closed, even if an 
exception is thrown during the call of responseObserver.onError
 # The exception occurred during the call to acceptPut should not be swallowed.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5973) [Java] Variable width vectors' get methods should return return null when the underlying data is null

2019-07-17 Thread Liya Fan (JIRA)
Liya Fan created ARROW-5973:
---

 Summary: [Java] Variable width vectors' get methods should return 
return null when the underlying data is null
 Key: ARROW-5973
 URL: https://issues.apache.org/jira/browse/ARROW-5973
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


For variable-width vectors (VarCharVector and VarBinaryVector), when the 
validity bit is not set, it means the underlying data is null, so the get 
method should return null.

However, the current implementation throws an IllegalStateException when 
NULL_CHECKING_ENABLED is set, or returns an empty array when the flag is clear.

Maybe the purpose of this design is to be consistent with fixed-width vectors. 
However, the scenario is different: fixed-width vectors (e.g. IntVector) throw 
an IllegalStateException, simply because the primitive types are non-nullable.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5970) [Java] Provide pointer to Arrow buffer

2019-07-17 Thread Liya Fan (JIRA)
Liya Fan created ARROW-5970:
---

 Summary: [Java] Provide pointer to Arrow buffer
 Key: ARROW-5970
 URL: https://issues.apache.org/jira/browse/ARROW-5970
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


Introduce pointer to a memory region within an ArrowBuf.

This pointer will be used as the basis for calculating the hash code within a 
vector, and equality determination.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5920) [Java] Support sort & compare for all variable width vectors

2019-07-12 Thread Liya Fan (JIRA)
Liya Fan created ARROW-5920:
---

 Summary: [Java] Support sort & compare for all variable width 
vectors
 Key: ARROW-5920
 URL: https://issues.apache.org/jira/browse/ARROW-5920
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


All variable-width vector can reuse the same comparator for sorting & searching.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5918) [Java] Revise the BaseIntVector interface

2019-07-12 Thread Liya Fan (JIRA)
Liya Fan created ARROW-5918:
---

 Summary: [Java] Revise the BaseIntVector interface
 Key: ARROW-5918
 URL: https://issues.apache.org/jira/browse/ARROW-5918
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan


1. In the set method should not use long as parameter. It is hardly the case 
that there are more than 2^32 distinct values in a dictionary. If it really 
happens, maybe it means we should not have used dictionary in the first place. 

2. In addition to the get method, there should also be a set method. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5917) [Java] Redesign the dictionary encoder

2019-07-12 Thread Liya Fan (JIRA)
Liya Fan created ARROW-5917:
---

 Summary: [Java] Redesign the dictionary encoder
 Key: ARROW-5917
 URL: https://issues.apache.org/jira/browse/ARROW-5917
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


The current dictionary encoder implementation 
(org.apache.arrow.vector.dictionary.DictionaryEncoder) has heavy performance 
overhead, which prevents it from being useful in practice:
 # There are repeated conversions between Java objects and bytes (e.g. 
vector.getObject(i)).
 # Unnecessary memory copy (the vector data must be copied to the hash table).
 # The hash table cannot be reused for encoding multiple vectors (other data 
structure & results cannot be reused either).
 # The output vector should not be created/managed by the encoder (just like in 
the out-of-place sorter)
 # The hash table requires that the hashCode & equals methods be implemented 
appropriately, but this is not guaranteed.

We plan to implement a new one in the algorithm module, and gradually deprecate 
the current one.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5911) [Java] Make ListVector and MapVector create reader lazily

2019-07-11 Thread Liya Fan (JIRA)
Liya Fan created ARROW-5911:
---

 Summary: [Java] Make ListVector and MapVector create reader lazily
 Key: ARROW-5911
 URL: https://issues.apache.org/jira/browse/ARROW-5911
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


Current implementation creates reader eagerly, which may cause unnecessary 
resource and time. This issue changes the behavior to lazily create the reader.

This is a follow-up issue for ARROW-5897.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (ARROW-5898) [Java] Provide functionality to efficiently compute hash code for arbitrary memory segment

2019-07-10 Thread Liya Fan (JIRA)
Liya Fan created ARROW-5898:
---

 Summary: [Java] Provide functionality to efficiently compute hash 
code for arbitrary memory segment
 Key: ARROW-5898
 URL: https://issues.apache.org/jira/browse/ARROW-5898
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


This issue adds a functionality to efficiently compute  the hash code for a 
consecutive memory region. This functionality is important in practical 
scenarios because it helps:

* Avoid unnecessary memory copy.

* Avoid repeated conversions between Java objects & Arrow buffers. 

Since the algorithm for calculating hash code has  significant performance 
implications, we need to design an interface so that different algorithms can 
be easily introduces as a plug-in.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5897) [Java] Remove duplicated logic in MapVector

2019-07-09 Thread Liya Fan (JIRA)
Liya Fan created ARROW-5897:
---

 Summary: [Java] Remove duplicated logic in MapVector
 Key: ARROW-5897
 URL: https://issues.apache.org/jira/browse/ARROW-5897
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


Current implementation of MapVector contains much logic duplicate from the 
super class. We remove the duplication by:
 # Making the default data vector name configurable
 # Extract a method for creating the reader



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5884) [Java] Fix the get method of StructVector

2019-07-09 Thread Liya Fan (JIRA)
Liya Fan created ARROW-5884:
---

 Summary: [Java] Fix the get method of StructVector
 Key: ARROW-5884
 URL: https://issues.apache.org/jira/browse/ARROW-5884
 Project: Apache Arrow
  Issue Type: Bug
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


When the data at the specified location is null, there is no need to call the 
method from super to set the reader

holder.isSet = isSet(index);
super.get(index, holder);



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5881) [Java] Provide functionalities to efficiently determine if a validity buffer has completely 1 bits/0 bits

2019-07-09 Thread Liya Fan (JIRA)
Liya Fan created ARROW-5881:
---

 Summary: [Java] Provide functionalities to efficiently determine 
if a validity buffer has completely 1 bits/0 bits
 Key: ARROW-5881
 URL: https://issues.apache.org/jira/browse/ARROW-5881
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


These utilities can be used to efficiently determine, for example, 
* If all values in a vector are null
* If a vector contains no null
* If a vector contains any valid element
* If a vector contains any invalid element



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5844) [Java] Support comparison & sort for more numeric types

2019-07-04 Thread Liya Fan (JIRA)
Liya Fan created ARROW-5844:
---

 Summary: [Java] Support comparison & sort for more numeric types
 Key: ARROW-5844
 URL: https://issues.apache.org/jira/browse/ARROW-5844
 Project: Apache Arrow
  Issue Type: New Feature
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


Currently, we only support comparison & sort for 32-bit integers, in this 
issue, we provide support for more numeric data types:
* byte
* short
* long
* float
* double



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (ARROW-5843) [Java] Improve the readability and performance of BitVectorHelper#getNullCount

2019-07-04 Thread Liya Fan (JIRA)
Liya Fan created ARROW-5843:
---

 Summary: [Java] Improve the readability and performance of 
BitVectorHelper#getNullCount
 Key: ARROW-5843
 URL: https://issues.apache.org/jira/browse/ARROW-5843
 Project: Apache Arrow
  Issue Type: Improvement
  Components: Java
Reporter: Liya Fan
Assignee: Liya Fan


Improve the performance by:
1. Count the number of 1 bits by long or int, instead of by byte
2. If the number of value count is a multiple of 8, there is no need to process 
the last byte separately. This makes the code clearer. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


  1   2   >