Hi Julien and Jacques, I saw there seems another thread on ByteBuffer based reading on https://issues.apache.org/jira/browse/PARQUET-77 which is similiar to my commits. And it also introduced a CompatibilityUtil.java in Parquet. I think we can combine the code in here and my pull request on github.
Best, Sunyu On Fri, Aug 15, 2014 at 11:21 PM, sunyu duan <[email protected]> wrote: > Thank you! I've updated the pull request to enable enforcer-plugin and > added some compatible interface to make it won't break compatible test. > Now I think the pull request is ready to merge. > I'm waiting for comments on the codes if it still needs some improvement > before merged. And I'm really happy if you can try it out on real cluster. > > > On Thu, Aug 14, 2014 at 11:56 PM, Jacques Nadeau <[email protected]> > wrote: > >> Hi Sunyu, >> >> Nice work! We've been working with your patch and enhancing it for >> incorporation into Apache Drill. What do you think the timeline and steps >> are to get this into master? We'd be more than happy to help depending on >> your time for this in the coming weeks. >> >> thanks, >> Jacques >> >> >> >> >> On Thu, Aug 14, 2014 at 8:30 AM, sunyu duan <[email protected]> wrote: >> >> > Hi everyone, >> > >> > My name is Sunyu Duan, this year GSOC Student who working on Parquet. >> > >> > As most of the work has been done, I wrote this report to summarize >> what I >> > have done and the result. >> > >> > My Project is Using Zero-Copy read path in new Hadoop API. The goal is >> to >> > exploit the Zero-Copy API introduced by Hadoop to improve read >> performance >> > of parquet tasks running locally. My contribution is to replace byte >> array >> > based API with ByteBuffer based API in the reading path to avoid byte >> array >> > copy and keep compatible with old APIs. Here is the complete pull >> request. >> > https://github.com/apache/incubator-parquet-mr/pull/6 >> > >> > My work includes two parts. >> > >> > 1. Make the whole read path use ByteBuffer directly. >> > >> > >> > - Introduce an initFromPage interface in ValueRead and implement it >> in >> > each ValueReader. >> > - Introduce a ByteBufferInputStream. >> > - Introduce a ByteBufferBytesInput. >> > - Replace unpack8values method with a ByteBuffer version. >> > - Use introduced ByteBuffer based method in the read path. >> > >> > >> > 1. Introduce a Compatible layer to keep compatible with old Hadoop >> API >> > >> > >> > - Introduce a CompatibilityUtil >> > - Using the CompatiblityUtil to perform read action >> > >> > >> > >> > After coding, I started to benchmark the improvement. After discussion >> with >> > my mentor, I modified the TestInputOutputFormat test to inherit >> > ClusterMapReduceTestCase which will start a MiniCluster for unit test. >> In >> > the unit test, I enabled caching and read shortcircuiting. I created a >> > 500MB and a 1GB log file on my dev box for the test. The test will read >> in >> > the log file and write to the temporary parquet format file using >> > MapReduce. Then it will read from the temporary parquet format file and >> > write to an output file. I inserted time counter on the latter mapreduce >> > task and used the time spent on the seconde MapReduce Job as an >> indicator. >> > I ran the unit test with and without Zero-Copy API enabled on 500MB and >> 1GB >> > log file and compared the time spent on each situation. The result shows >> > below. >> > >> > >> > >> > File >> > Size Average Reading Time(s) >> Improvement >> > >> > Without Zero-Copy API 500MB >> > 576s >> > >> > Zero-Copy API >> > 500MB 394s >> > 46% >> > >> > Without Zero-Copy API 1024MB >> > 1080s >> > >> > Zero-Copy API 1024MB >> > 781s 38% >> > >> > >> > >> > As we can see, there is about 30~50% improvement on reading performance >> > which shows the project has reached its goal. But the benchmark is >> > insufficient. My dev box has very limited resources and 1GB file is the >> > maximum file I can put. After GSOC, it'd be better to invite more >> people to >> > try it out on real cluster with larger file to benchmark its effect on >> real >> > situation. >> > >> > >> > Best, >> > >> > Sunyu >> > >> > >
