[jira] [Created] (ARROW-6221) [Java] Improve the performance of RangeEqualVisitor for comparing variable-width vectors
Liya Fan created ARROW-6221: --- Summary: [Java] Improve the performance of RangeEqualVisitor for comparing variable-width vectors Key: ARROW-6221 URL: https://issues.apache.org/jira/browse/ARROW-6221 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Liya Fan Assignee: Liya Fan Two improvements: # Compare the whole range of the data buffer, instead of comparing individual elements. # If two elements are of different sizes, there is no need to compare them. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
Re: Running an Arrow hackathon
Hi David, This is cool, thank you for doing it. My thoughts: > - Is there a label we already use for easy-to-start-with issues? I see > variations on newbie/easy-fix/beginner on JIRA, is there a preference > for one? I think beginner would be my preferred label. Note that I don't think it has been consistently applied - There would (hopefully) be an influx of PRs. We wouldn't expect any > sort of timeliness on reviews, but it could exacerbate the > Travis/AppVeyor capacity problem - should I encourage people to set up > personal Travis instances? A personal travis instance could be helpful but I think if there isn't a rush for reviews, they could be staggered to not have too much of an impact on development. On Mon, Aug 12, 2019 at 6:55 AM David Li wrote: > Hi all, > > We're thinking of hosting an internal open-source hackathon in > September. I wanted to make Apache Arrow one of the projects we work > on, so I wanted to give maintainers here a heads up, and clarify a few > things. > > I would be around to help set up environments and make sure that PRs > follow the expected format. I could also do first-pass reviews. We > would focus on Python/Java/Rust as those have the most interest > (though maybe we could snag a few Gophers). > > At this point I'm not sure how many participants we'll have - most > likely no more than 10 or so. > > - Is there a label we already use for easy-to-start-with issues? I see > variations on newbie/easy-fix/beginner on JIRA, is there a preference > for one? > - There would (hopefully) be an influx of PRs. We wouldn't expect any > sort of timeliness on reviews, but it could exacerbate the > Travis/AppVeyor capacity problem - should I encourage people to set up > personal Travis instances? > > Thanks, > David >
[jira] [Created] (ARROW-6220) [Java] Add API to avro adapter to limit number of rows returned at a time.
Micah Kornfield created ARROW-6220: -- Summary: [Java] Add API to avro adapter to limit number of rows returned at a time. Key: ARROW-6220 URL: https://issues.apache.org/jira/browse/ARROW-6220 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Micah Kornfield We can either let clients iterate or ideally provide an iterator interface. This is important for large avro data and was also discussed as something readers/adapters should haven. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6219) [Java] Add API for JDBC adapter that can convert less then the full result set at a time.
Micah Kornfield created ARROW-6219: -- Summary: [Java] Add API for JDBC adapter that can convert less then the full result set at a time. Key: ARROW-6219 URL: https://issues.apache.org/jira/browse/ARROW-6219 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Micah Kornfield Somehow we should configure number of rows per batch and either let clients iterate or provide an iterator API. Otherwise for large result sets we might run out of memory. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6218) [Java] Add UINT type test in integration to avoid potential overflow
Ji Liu created ARROW-6218: - Summary: [Java] Add UINT type test in integration to avoid potential overflow Key: ARROW-6218 URL: https://issues.apache.org/jira/browse/ARROW-6218 Project: Apache Arrow Issue Type: Test Components: Java Reporter: Ji Liu Assignee: Ji Liu As per discussion [https://github.com/apache/arrow/pull/5002] For UINT type, when write/read json data in integration test, it extend data type(i.e. Long->BigInteger, Int->Long) to avoid potential overflow. Like UINT8 the write side and read side code like this: {code:java} case UINT8: generator.writeNumber(UInt8Vector.getNoOverflow(buffer, index)); break;{code} {code:java} BigInteger value = parser.getBigIntegerValue(); buf.writeLong(value.longValue()); {code} Should add a test to avoid potential overflow in the data transfer process. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
Re: [Discussion][Java] Redesign the dictionary encoder
Hi Liya Fan, Ji Liu has an open pull request [1] that refactors the existing implementation to address the re-use aspect. I think it can also be extended to fix the memory ownership problem you highlighted. More work would need to be address to address the customizable hash customizable hash. Could you two please work together to figure out how to reconcile the following differences: 1. The new implementations you referenced, require access to an ArrowBufPointer which precludes the usage on complex types. The existing implementation works with complex types. 2. The existing implementation has a customized hash table that avoids the need for boxing/unboxing. If I remember correctly I think this showed approximately 3-5% performance improvement in encoding. In both cases, it would probably be nice to move to an off-heap solution. Also, for removing the old encoder implementation could you provide more details? The current encoder is used in the Vector module in unit tests at least, and the new encoders are in the algorithm package. How do you plan on resolving the dependencies? [1] https://github.com/apache/arrow/pull/5055/files On Sun, Aug 11, 2019 at 1:18 AM Fan Liya wrote: > Dear all, > > Dictionary encoding is an important feature, so it should be implemented > with good performance. > The current Java dictionary encoder implementation is based on static > utility methods in org.apache.arrow.vector.dictionary.DictionaryEncoder, > which has heavy performance overhead, preventing it from being useful in > practice: > > 1. The hash table cannot be reused for encoding multiple vectors (other > data structure & results cannot be reused either). > 2. The output vector should not be created/managed by the encoder (just > like in the out-of-place sorter) > 3. Different scenarios requires different algorithms to compute the hash > code to avoid conflicts in the hash table, but this is not supported. > > Although some problems can be overcome by refactoring the current > implementation, it is difficult to do so without significantly chaning the > current API. > So we propse new design [1][2] of the dictionary encoder, to make it more > performant in practice. > > We plan to implement the new dictionary encoders with stateful objects, so > many useful partial/immediate results can be reused. The new encoders > support using different hash code algorithms in different scenarios to > achieve good performance. > > We plan to support the new encoders in the following steps: > > 1. implement the new dictionary encoders in the algorithm module [3][4] > 2. make the old dictionary encoder deprecated > 3. remove the old encoder implementations > > Please give your valuable comments. > > Best, > Liya Fan > > [1] https://issues.apache.org/jira/browse/ARROW-5917 > [2] https://issues.apache.org/jira/browse/ARROW-6184 > [3] https://github.com/apache/arrow/pull/4994 > [4] https://github.com/apache/arrow/pull/5058 >
[jira] [Created] (ARROW-6217) [Website] Remove needless _site/ directory
Sutou Kouhei created ARROW-6217: --- Summary: [Website] Remove needless _site/ directory Key: ARROW-6217 URL: https://issues.apache.org/jira/browse/ARROW-6217 Project: Apache Arrow Issue Type: Task Components: Website Reporter: Sutou Kouhei Assignee: Sutou Kouhei -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6216) Allow user to select the ZSTD compression level
Martin Radev created ARROW-6216: --- Summary: Allow user to select the ZSTD compression level Key: ARROW-6216 URL: https://issues.apache.org/jira/browse/ARROW-6216 Project: Apache Arrow Issue Type: Improvement Components: C++ Reporter: Martin Radev The compression level selected in Arrow for ZSTD is 1 which is the minimal compression level for the compressor. This leads to very high compression speed at the sacrifice of compression ratio. The user should be allowed to select the compression level as both speed and ratio are data specific. The proposed solution is to expose the knob via an environment variable such as ARROW_ZSTD_COMPRESSION_LEVEL. Example: export ARROW_ZSTD_COMPRESSION_LEVEL=10 ./my_parquet_app -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6215) [Java] RangeEqualVisitor does not properly compare ZeroVector
Bryan Cutler created ARROW-6215: --- Summary: [Java] RangeEqualVisitor does not properly compare ZeroVector Key: ARROW-6215 URL: https://issues.apache.org/jira/browse/ARROW-6215 Project: Apache Arrow Issue Type: Bug Components: Java Reporter: Bryan Cutler Assignee: Bryan Cutler ZeroVector.accept and RangeEqualVisitor always return True no matter what type of other vector is compared -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6214) Sanitizer errors triggered via R bindings
Jeroen created ARROW-6214: - Summary: Sanitizer errors triggered via R bindings Key: ARROW-6214 URL: https://issues.apache.org/jira/browse/ARROW-6214 Project: Apache Arrow Issue Type: Bug Components: C++, R Affects Versions: 0.14.1 Environment: Linux Reporter: Jeroen When we run the examples of the R package through the sanitizers, several errors show up. These could be related to the segfaults we saw on the macos builder on CRAN. Steps to reproduce + example outputs at: https://gist.github.com/jeroen/111901c351a4089a9effa90691a1dd81 -- This message was sent by Atlassian JIRA (v7.6.14#76016)
Re: [DISCUSS] Developing a "data frame" subproject in the Arrow C++ libraries
hi Eric -- there have not been any patches yet related to it. I'm currently in the midst of some internal restructuring of the Parquet C++ library to address long-standing efficiency and memory use issues. It's my intention to spend time on the data frame project as one of my next focus areas, likely to be after Labor Day. - Wes On Mon, Aug 12, 2019 at 10:28 AM Eric Erhardt wrote: > > Hey Wes, > > I just wanted to check-in on this work. Have there been any updates to the > Arrow "data frame" project worth sharing? > > Thanks, > Eric > > -Original Message- > From: Wes McKinney > Sent: Tuesday, May 21, 2019 8:17 AM > To: dev@arrow.apache.org > Subject: Re: [DISCUSS] Developing a "data frame" subproject in the Arrow C++ > libraries > > On Tue, May 21, 2019, 8:43 AM Antoine Pitrou wrote: > > > > > Le 21/05/2019 à 13:42, Wes McKinney a écrit : > > > hi Antoine, > > > > > > On Tue, May 21, 2019 at 5:48 AM Antoine Pitrou > > wrote: > > >> > > >> > > >> Hi Wes, > > >> > > >> How does copy-on-write play together with memory-mapped data? It > > >> seems that, depending on whether the memory map has several > > >> concurrent users (a condition which may be timing-dependent), we > > >> will either persist changes on disk or make them ephemeral in > > >> memory. That doesn't sound very user-friendly, IMHO. > > > > > > With memory-mapping, any Buffer is sliced from the parent MemoryMap > > > [1] so mutating the data on disk using this interface wouldn't be > > > possible with the way that I've framed it. > > > > Hmm... I always forget that SliceBuffer returns a read-only view. > > > > The more important issue is that parent_ is non-null. The idea is that no > mutation is allowed if we reason that another Buffer object has access to the > address space of interest. I think this style of copy-on-write is a > reasonable compromise that prevents most kinds of defensive copying. > > > > Regards > > > > Antoine. > >
RE: [DISCUSS] Developing a "data frame" subproject in the Arrow C++ libraries
Hey Wes, I just wanted to check-in on this work. Have there been any updates to the Arrow "data frame" project worth sharing? Thanks, Eric -Original Message- From: Wes McKinney Sent: Tuesday, May 21, 2019 8:17 AM To: dev@arrow.apache.org Subject: Re: [DISCUSS] Developing a "data frame" subproject in the Arrow C++ libraries On Tue, May 21, 2019, 8:43 AM Antoine Pitrou wrote: > > Le 21/05/2019 à 13:42, Wes McKinney a écrit : > > hi Antoine, > > > > On Tue, May 21, 2019 at 5:48 AM Antoine Pitrou > wrote: > >> > >> > >> Hi Wes, > >> > >> How does copy-on-write play together with memory-mapped data? It > >> seems that, depending on whether the memory map has several > >> concurrent users (a condition which may be timing-dependent), we > >> will either persist changes on disk or make them ephemeral in > >> memory. That doesn't sound very user-friendly, IMHO. > > > > With memory-mapping, any Buffer is sliced from the parent MemoryMap > > [1] so mutating the data on disk using this interface wouldn't be > > possible with the way that I've framed it. > > Hmm... I always forget that SliceBuffer returns a read-only view. > The more important issue is that parent_ is non-null. The idea is that no mutation is allowed if we reason that another Buffer object has access to the address space of interest. I think this style of copy-on-write is a reasonable compromise that prevents most kinds of defensive copying. > Regards > > Antoine. >
[jira] [Created] (ARROW-6213) C++ tests fails for AVX512
Charles Coulombe created ARROW-6213: --- Summary: C++ tests fails for AVX512 Key: ARROW-6213 URL: https://issues.apache.org/jira/browse/ARROW-6213 Project: Apache Arrow Issue Type: Bug Components: C++ Affects Versions: 0.14.1 Environment: CentOS 7.6.1810, Intel Xeon Processor (Skylake, IBRS) avx512 Reporter: Charles Coulombe Attachments: arrow-0.14.1-c++-failed-tests-cmake-conf.txt, arrow-0.14.1-c++-failed-tests.txt When building libraries for avx512 with GCC 7.3.0, two C++ tests fails. {code:java} The following tests FAILED: 28 - arrow-compute-compare-test (Failed) 30 - arrow-compute-filter-test (Failed) Errors while running CTest {code} while for avx2 they passes. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
Re: Proposal to move website source to arrow-site, add automatic builds
I started https://github.com/apache/arrow/pull/5015 for the removal last week; will finish that up today or tomorrow. Neal On Sun, Aug 11, 2019 at 8:23 AM Wes McKinney wrote: > > It looks like the git pruning is done. So we can remove the site/ > directory from the main repository at some point soon. > > On Thu, Aug 8, 2019 at 2:29 PM Neal Richardson > wrote: > > > > I need a committer to make a master branch on arrow-site so that I can > > PR to it. I thought it could be just an empty orphan branch but that > > proved not to work, so a committer will need to do the following: > > > > ``` > > git clone g...@github.com:$YOURGITHUB/arrow.git arrow-copy > > cd arrow-copy > > git filter-branch --prune-empty --subdirectory-filter site master > > vi .git/config > > # Change remote "origin"'s URL to be g...@github.com:arrow/arrow-site.git > > git push -f origin master > > ``` > > > > On Thu, Aug 8, 2019 at 12:07 PM Wes McKinney wrote: > > > > > > Yes, I think we have adequate lazy consensus. Can you spell out what > > > are the next steps? > > > > > > On Thu, Aug 8, 2019 at 2:01 PM Neal Richardson > > > wrote: > > > > > > > > Have we reached "lazy consensus" here? No further comments in the last > > > > three days. > > > > > > > > Thanks, > > > > Neal > > > > > > > > On Mon, Aug 5, 2019 at 1:46 PM Joris Van den Bossche > > > > wrote: > > > > > > > > > > This sounds as a good proposal to me (at least at the moment where we > > > > > have > > > > > separate docs and main site). > > > > > I agree that documentation should indeed stay with the code, as you > > > > > want to > > > > > update those together in PRs. But the website is something you can > > > > > typically update separately and also might want to update > > > > > independently > > > > > from code releases. And certainly if this proposal makes it easier to > > > > > work > > > > > on the site, all the better. > > > > > > > > > > Joris > > > > > > > > > > Op ma 5 aug. 2019 20:30 schreef Wes McKinney : > > > > > > > > > > > Let's wait a little while to collect any additional opinions about > > > > > > this. > > > > > > > > > > > > There's pretty good evidence from other Apache projects that this > > > > > > isn't too bad of an idea > > > > > > > > > > > > Apache Calcite: https://github.com/apache/calcite-site > > > > > > Apache Kafka: https://github.com/apache/kafka-site > > > > > > Apache Spark: https://github.com/apache/spark-website > > > > > > > > > > > > The Apache projects I've seen where the same repository is used for > > > > > > $FOO.apache.org tend to be ones where the documentation _is_ the > > > > > > website. I think we would need to commission a significant web > > > > > > design > > > > > > overhaul to be able to make our documentation page adequate as the > > > > > > landing point for visitors to https://arrow.apache.org. > > > > > > > > > > > > On Sat, Aug 3, 2019 at 3:46 PM Neal Richardson > > > > > > wrote: > > > > > > > > > > > > > > Given the status quo, it would be difficult for this to make the > > > > > > > Arrow > > > > > > > website less maintained. In fact, arrow-site is currently missing > > > > > > > the > > > > > > > most recent two patches that modified the site directory in > > > > > > > apache/arrow. Having multiple manual deploy steps increases the > > > > > > > likelihood that the website stays stale. > > > > > > > > > > > > > > As someone who has been working on the arrow site lately, this > > > > > > > proposal makes it easier for me to make changes to the website > > > > > > > because > > > > > > > I can automatically deploy my changes to a test site, and that > > > > > > > lets > > > > > > > others in the community, who perhaps don't touch the website much, > > > > > > > verify that they're good. > > > > > > > > > > > > > > I agree that the documentation situation needs attention, but as I > > > > > > > said initially, that's orthogonal to this static site generation. > > > > > > > I'd > > > > > > > like to work on that next, and I think these changes will make it > > > > > > > easier to do. I would not propose moving doc generation out of > > > > > > > apache/arrow--that belongs with the code. > > > > > > > > > > > > > > Neal > > > > > > > > > > > > > > On Sat, Aug 3, 2019 at 9:49 AM Wes McKinney > > > > > > > wrote: > > > > > > > > > > > > > > > > I think that the project website and the project documentation > > > > > > > > are > > > > > > > > currently distinct entities. The current Jekyll website is > > > > > > > > independent > > > > > > > > from the Sphinx documentation project aside from a link to the > > > > > > > > documentation from the website. > > > > > > > > > > > > > > > > I am guessing that we would want to maintain some amount of > > > > > > > > separation > > > > > > > > between the main site at arrow.apache.org and the code / format > > > > > > > > documentation, at minimum because we may want to make > > > > > > > > documentation > > > > > > > > available for multiple
Running an Arrow hackathon
Hi all, We're thinking of hosting an internal open-source hackathon in September. I wanted to make Apache Arrow one of the projects we work on, so I wanted to give maintainers here a heads up, and clarify a few things. I would be around to help set up environments and make sure that PRs follow the expected format. I could also do first-pass reviews. We would focus on Python/Java/Rust as those have the most interest (though maybe we could snag a few Gophers). At this point I'm not sure how many participants we'll have - most likely no more than 10 or so. - Is there a label we already use for easy-to-start-with issues? I see variations on newbie/easy-fix/beginner on JIRA, is there a preference for one? - There would (hopefully) be an influx of PRs. We wouldn't expect any sort of timeliness on reviews, but it could exacerbate the Travis/AppVeyor capacity problem - should I encourage people to set up personal Travis instances? Thanks, David
Re: [Discuss][FlightRPC] Extensions to Flight: middleware and DoPut tickets
I've (finally) put up a draft implementation of middleware for Java: https://github.com/apache/arrow/pull/5068 Hopefully this helps clarify how the proposal works. Best, David On 7/25/19, David Li wrote: > Thanks for the feedback, Antoine. That would be a natural method to > have - then the server could deny uploads (as you mention) or note > that the stream already exists. I've updated the proposal to reflect > that, leaving more detailed semantics (e.g. append vs overwrite) > application-defined. > > Best, > David > > On 7/25/19, Antoine Pitrou wrote: >> >> Le 08/07/2019 à 16:33, David Li a écrit : >>> Hi all, >>> >>> I've put together two more proposals for Flight, motivated by projects >>> we've been working on. I'd appreciate any comments on the >>> design/reasoning; I'm already working on the implementation, alongside >>> some other improvements to Flight. >>> >>> The first is to modify the DoPut call to follow the same request >>> pattern as DoGet. This is a format change and would require a vote. >>> >>> https://docs.google.com/document/d/1hrwxNwPU1aOD_1ciRUOaGeUCyXYOmu6IxxCfY6Stj6w/edit?usp=sharing >> >> It seems it would be useful to introduce a GetPutInfo (or GetUploadInfo) >> so as to allow differential behaviour between getting and putting. >> >> (one trivial case would be to disallow uploading altogether :-))) >> >> Regards >> >> Antoine. >> >
[jira] [Created] (ARROW-6212) [Java] Support vector rank operation
Liya Fan created ARROW-6212: --- Summary: [Java] Support vector rank operation Key: ARROW-6212 URL: https://issues.apache.org/jira/browse/ARROW-6212 Project: Apache Arrow Issue Type: New Feature Components: Java Reporter: Liya Fan Assignee: Liya Fan Given an unsorted vector, we want to get the index of the ith smallest element in the vector. This function is supported by the rank operation. We provide an implementation that gets the index with the desired rank, without sorting the vector (the vector is left intact), and the implementation takes O(n) time, where n is the vector length. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
Re: [ANNOUNCE] New Arrow PMC member: Micah Kornfield
Congrats Micah, very well deserved !! On Mon, Aug 12, 2019 at 8:35 AM Micah Kornfield wrote: > Thanks everyone for the good wishes! > > On Fri, Aug 9, 2019 at 5:41 PM Fan Liya wrote: > > > Big congratulations! Micah > > Thank you so much for all the help! > > > > Best, > > Liya Fan > > > > On Saturday, August 10, 2019, Brian Hulette wrote: > > > Congratulations Micah! Well deserved :) > > > > > > On Fri, Aug 9, 2019 at 9:02 AM Francois Saint-Jacques < > > > fsaintjacq...@gmail.com> wrote: > > > > > >> Congrats! > > >> > > >> well deserved. > > >> > > >> On Fri, Aug 9, 2019 at 11:12 AM Wes McKinney > > wrote: > > >> > > > >> > The Project Management Committee (PMC) for Apache Arrow has invited > > >> > Micah Kornfield to become a PMC member and we are pleased to > announce > > >> > that Micah has accepted. > > >> > > > >> > Congratulations and welcome! > > >> > > > > > >
[jira] [Created] (ARROW-6211) [Java] Remove dependency on RangeEqualsVisitor from ValueVector interface
Pindikura Ravindra created ARROW-6211: - Summary: [Java] Remove dependency on RangeEqualsVisitor from ValueVector interface Key: ARROW-6211 URL: https://issues.apache.org/jira/browse/ARROW-6211 Project: Apache Arrow Issue Type: Bug Reporter: Pindikura Ravindra This is a follow-up from [https://github.com/apache/arrow/pull/4933] public interface VectorVisitor \{..} In ValueVector : public OUT accept(VectorVisitor visitor, IN value) throws EX; -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6210) [Java] remove equals API from ValueVector
Pindikura Ravindra created ARROW-6210: - Summary: [Java] remove equals API from ValueVector Key: ARROW-6210 URL: https://issues.apache.org/jira/browse/ARROW-6210 Project: Apache Arrow Issue Type: Bug Reporter: Pindikura Ravindra This is a follow-up from [https://github.com/apache/arrow/pull/4933] The callers should be fixed to use the RangeEquals API instead. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6209) [Java] Extract set null method to the base class for fixed width vectors
Liya Fan created ARROW-6209: --- Summary: [Java] Extract set null method to the base class for fixed width vectors Key: ARROW-6209 URL: https://issues.apache.org/jira/browse/ARROW-6209 Project: Apache Arrow Issue Type: Improvement Components: Java Reporter: Liya Fan Assignee: Liya Fan Currently, each fixed width vector has the setNull method. All these implementations are identical, so we move them to the base class. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (ARROW-6208) [Java] Correct byte order before comparing in ByteFunctionHelpers
Prudhvi Porandla created ARROW-6208: --- Summary: [Java] Correct byte order before comparing in ByteFunctionHelpers Key: ARROW-6208 URL: https://issues.apache.org/jira/browse/ARROW-6208 Project: Apache Arrow Issue Type: Bug Components: Java Affects Versions: 1.0.0 Reporter: Prudhvi Porandla Assignee: Prudhvi Porandla Fix For: 1.0.0 -- This message was sent by Atlassian JIRA (v7.6.14#76016)