Hi Sunyu, I logged the JIRA and the patch I attached is based off of your GSOC work. We would like to have it included in the same pull request, but there is some work that needs to be done on the patch I submitted. In particular, I think we need to solve the issue of backward compatibility for the direct decompressor implementation for the Snappy decompressor. Hoping for some feedback on the best way to address that. Thanks
Parth On Tue, Aug 26, 2014 at 6:52 PM, sunyu duan <[email protected]> wrote: > Hi Julien and Jacques, > > I saw there seems another thread on ByteBuffer based reading on > https://issues.apache.org/jira/browse/PARQUET-77 which is similiar to my > commits. And it also introduced a CompatibilityUtil.java in Parquet. I > think we can combine the code in here and my pull request on github. > > Best, > Sunyu > > > On Fri, Aug 15, 2014 at 11:21 PM, sunyu duan <[email protected]> wrote: > > > Thank you! I've updated the pull request to enable enforcer-plugin and > > added some compatible interface to make it won't break compatible test. > > Now I think the pull request is ready to merge. > > I'm waiting for comments on the codes if it still needs some improvement > > before merged. And I'm really happy if you can try it out on real > cluster. > > > > > > On Thu, Aug 14, 2014 at 11:56 PM, Jacques Nadeau <[email protected]> > > wrote: > > > >> Hi Sunyu, > >> > >> Nice work! We've been working with your patch and enhancing it for > >> incorporation into Apache Drill. What do you think the timeline and > steps > >> are to get this into master? We'd be more than happy to help depending > on > >> your time for this in the coming weeks. > >> > >> thanks, > >> Jacques > >> > >> > >> > >> > >> On Thu, Aug 14, 2014 at 8:30 AM, sunyu duan <[email protected]> wrote: > >> > >> > Hi everyone, > >> > > >> > My name is Sunyu Duan, this year GSOC Student who working on Parquet. > >> > > >> > As most of the work has been done, I wrote this report to summarize > >> what I > >> > have done and the result. > >> > > >> > My Project is Using Zero-Copy read path in new Hadoop API. The goal is > >> to > >> > exploit the Zero-Copy API introduced by Hadoop to improve read > >> performance > >> > of parquet tasks running locally. My contribution is to replace byte > >> array > >> > based API with ByteBuffer based API in the reading path to avoid byte > >> array > >> > copy and keep compatible with old APIs. Here is the complete pull > >> request. > >> > https://github.com/apache/incubator-parquet-mr/pull/6 > >> > > >> > My work includes two parts. > >> > > >> > 1. Make the whole read path use ByteBuffer directly. > >> > > >> > > >> > - Introduce an initFromPage interface in ValueRead and implement it > >> in > >> > each ValueReader. > >> > - Introduce a ByteBufferInputStream. > >> > - Introduce a ByteBufferBytesInput. > >> > - Replace unpack8values method with a ByteBuffer version. > >> > - Use introduced ByteBuffer based method in the read path. > >> > > >> > > >> > 1. Introduce a Compatible layer to keep compatible with old Hadoop > >> API > >> > > >> > > >> > - Introduce a CompatibilityUtil > >> > - Using the CompatiblityUtil to perform read action > >> > > >> > > >> > > >> > After coding, I started to benchmark the improvement. After discussion > >> with > >> > my mentor, I modified the TestInputOutputFormat test to inherit > >> > ClusterMapReduceTestCase which will start a MiniCluster for unit test. > >> In > >> > the unit test, I enabled caching and read shortcircuiting. I created a > >> > 500MB and a 1GB log file on my dev box for the test. The test will > read > >> in > >> > the log file and write to the temporary parquet format file using > >> > MapReduce. Then it will read from the temporary parquet format file > and > >> > write to an output file. I inserted time counter on the latter > mapreduce > >> > task and used the time spent on the seconde MapReduce Job as an > >> indicator. > >> > I ran the unit test with and without Zero-Copy API enabled on 500MB > and > >> 1GB > >> > log file and compared the time spent on each situation. The result > shows > >> > below. > >> > > >> > > >> > > >> > File > >> > Size Average Reading Time(s) > >> Improvement > >> > > >> > Without Zero-Copy API 500MB > >> > 576s > >> > > >> > Zero-Copy API > >> > 500MB 394s > >> > 46% > >> > > >> > Without Zero-Copy API 1024MB > >> > 1080s > >> > > >> > Zero-Copy API 1024MB > >> > 781s 38% > >> > > >> > > >> > > >> > As we can see, there is about 30~50% improvement on reading > performance > >> > which shows the project has reached its goal. But the benchmark is > >> > insufficient. My dev box has very limited resources and 1GB file is > the > >> > maximum file I can put. After GSOC, it'd be better to invite more > >> people to > >> > try it out on real cluster with larger file to benchmark its effect on > >> real > >> > situation. > >> > > >> > > >> > Best, > >> > > >> > Sunyu > >> > > >> > > > > >
