It doesn't appear so looking over the code base quickly. It seems like potentially PageReader [1] or Page [2] to track this information would be necessary.
[1] https://github.com/apache/arrow/blob/fe142922f6a4b801ece8fd16a1bff9836a8aaf77/cpp/src/parquet/column_reader.h#L99 [2] https://github.com/apache/arrow/blob/809d40ab9518bd254705f35af01162a9da588516/cpp/src/parquet/column_page.h On Fri, Jun 19, 2020 at 3:42 PM Lekshmi Narayanan, Arun Balajiee < [email protected]> wrote: > Hi > > I think I should reframe my question. I am working on PARQUET-1404. I am > not looking to extend or change this API. I want to understand ReadBatch > and my question for that was When I make calls for reading values from the > Parquet file internally row by row, is there a way to know which page I am > at? > > Regards > Arun Balajiee > > > Regards, > > Arun Balajiee > > ________________________________ > From: Micah Kornfield <[email protected]> > Sent: Thursday, June 18, 2020 11:21:42 PM > To: [email protected] <[email protected]> > Subject: Re: Arrow 1404: Adding index for Page-level Skipping > > Is this internally in the class or adding a parameter in the API? What is > the use case? > > On Saturday, June 13, 2020, Lekshmi Narayanan, Arun Balajiee < > [email protected]> wrote: > > > Hi Dev > > > > Thanks Wes for these comments. > > > > As Informed in other threads, I have completed most of it. Will try to > > structure it according to the comments. > > > > I had one question reading a (un)related matter. whenever we make calls > to > > > > ReadBatch(int64_t batch_size, int16_t* def_levels, > > int16_t* rep_levels, T* > > values, > > int64_t* values_read) > > > > Is there are possibility to keep track of which page we are at to > retrieve > > values? > > > > Regards > > Arun Balajiee > > ________________________________ > > From: Wes McKinney <[email protected]> > > Sent: 02 April 2020 13:16 > > To: Parquet Dev <[email protected]> > > Cc: Deepak Majeti <[email protected]>; Anatoli Shein < > > [email protected]> > > Subject: Re: Arrow 1404: Adding index for Page-level Skipping > > > > I just left comments on the PR. The new APIs (their semantics and what > > should be passed as arguments) are still not adequately documented (in > > other words, I wouldn't know how to use them just from reading the > > header file), so I think we should focus on that for the moment. In > > fairness documentation for other functions in these headers in poor, > > but they also have the semantics of "read all data in the file from > > start to finish". These new APIs appear to do something different, so > > we need to write that down in detail in Doxygen-style comments > > > > On Thu, Apr 2, 2020 at 2:23 AM Lekshmi Narayanan, Arun Balajiee > > <[email protected]> wrote: > > > > > > Hi > > > Would my pull request be useful for the discussion from here? > > > https://nam05.safelinks.protection.outlook.com/?url= > > https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fpull%2F6807&data=02%7C01% > > 7CARL122%40pitt.edu%7Cd36ddd6e18fb44808ef308d7d729b8c8% > > 7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C1%7C637214446509861845& > > sdata=PQAIxpTPm87qRb%2FmZoHXfLCsdcCiyC%2Biqui40tqEd9U%3D&reserved=0 > > > > > > Regards, > > > Arun Balajiee > > > > > > From: Wes McKinney<mailto:[email protected]> > > > Sent: Tuesday, February 18, 2020 3:34 AM > > > To: Parquet Dev<mailto:[email protected]> > > > Cc: Deepak Majeti<mailto:[email protected]>; Anatoli > > Shein<mailto:[email protected]> > > > Subject: Re: Arrow 1404: Adding index for Page-level Skipping > > > > > > That's helpful, but I think it would be a good idea to have enough > > > information in the header files to determine what the new APIs do > > > without reading example code. > > > > > > On Mon, Feb 17, 2020 at 10:59 AM Lekshmi Narayanan, Arun Balajiee > > > <[email protected]> wrote: > > > > > > > > I also made changes in the low-level-api folder, couldn’t capture in > > that link I think > > > > https://nam05.safelinks.protection.outlook.com/?url= > > https%3A%2F%2Fgithub.com%2Fa2un%2Farrow%2Fblob%2FPARQUET-1404-Add-index- > > pages-to-the-format-to-support-efficient-page- > > skipping-to-parquet-cpp%2Fcpp%2Fexamples%2Fparquet%2Flow- > > level-api%2Freader-writer-with-index.cc&data=02% > > 7C01%7CARL122%40pitt.edu%7Cd36ddd6e18fb44808ef308d7d729b8c8% > > 7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C1%7C637214446509861845& > > sdata=vxuK%2BvZRtwhLcGepda6T5i3r6HDk0JLS3vh9leIcBlo%3D&reserved=0 > > > > > > > > Regards, > > > > Arun Balajiee > > > > > > > > ________________________________ > > > > From: Wes McKinney <[email protected]> > > > > Sent: Monday, February 17, 2020 8:11:09 AM > > > > To: Parquet Dev <[email protected]> > > > > Cc: Deepak Majeti <[email protected]>; Anatoli Shein < > > [email protected]> > > > > Subject: Re: Arrow 1404: Adding index for Page-level Skipping > > > > > > > > hi Arun, > > > > > > > > By "public APIs" I was referring to changes in the public header > > > > files. I see there are some changes to parquet/file_reader.h and > > > > metadata.h > > > > > > > > https://nam05.safelinks.protection.outlook.com/?url= > > https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fcompare% > > 2Fmaster...a2un%3APARQUET-1404-Add-index-pages-to-the- > > format-to-support-efficient-page-skipping-to-parquet-cpp& > > amp;data=02%7C01%7CARL122%40pitt.edu%7Cd36ddd6e18fb44808ef308d7d729b8c8% > > > 7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C1%7C637214446509871841&sdata= > > rBl3pY6bRFuSzWg2QT2Ca6aui2HZJjSoh1mbzDq%2F93M%3D&reserved=0 > > > > > > > > Can you add some Doxygen comments to the new APIs that explain how > > > > these APIs are to be used (and what the parameters mean)? The hope > > > > would be that a user could make use of the column index functionality > > > > by reading the .h files only. > > > > > > > > Thanks > > > > Wes > > > > > > > > On Fri, Feb 14, 2020 at 2:57 PM Lekshmi Narayanan, Arun Balajiee > > > > <[email protected]> wrote: > > > > > > > > > > Hi > > > > > I have made my changes for api here, does it look good and is this > > what you were seeking from me? The writer- api is still in the works and > I > > need to make the reader more generic to support all class data types. > > > > > > > > > > https://nam05.safelinks.protection.outlook.com/?url= > > https%3A%2F%2Fgithub.com%2Fa2un%2Farrow%2Fblob%2FPARQUET-1404-Add-index- > > pages-to-the-format-to-support-efficient-page- > > skipping-to-parquet-cpp%2Fcpp%2Fexamples%2Fparquet%2Flow- > > level-api%2Freader-writer-with-index.cc&data=02% > > 7C01%7CARL122%40pitt.edu%7Cd36ddd6e18fb44808ef308d7d729b8c8% > > 7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C1%7C637214446509871841& > > sdata=TB29CbqF3MlD0v9d%2BOTH%2FD4NAF%2BqGJvpMpJZIeWd2P4%3D&reserved=0 > > > > > > > > > > > > > > > Regards, > > > > > Arun Balajiee > > > > > > > > > > From: Wes McKinney<mailto:[email protected]> > > > > > Sent: Tuesday, February 4, 2020 11:24 PM > > > > > To: Parquet Dev<mailto:[email protected]> > > > > > Cc: Deepak Majeti<mailto:[email protected]>; Anatoli > > Shein<mailto:[email protected]> > > > > > Subject: Re: Arrow 1404: Adding index for Page-level Skipping > > > > > > > > > > hi Arun, > > > > > > > > > > We can keep the discussion going on here and on GitHub when you > have > > a > > > > > pull request to discuss. There are a number of different people who > > > > > can give advice. > > > > > > > > > > Thanks > > > > > > > > > > On Tue, Feb 4, 2020 at 10:11 PM Lekshmi Narayanan, Arun Balajiee > > > > > <[email protected]> wrote: > > > > > > > > > > > > Actually I made some changes after the date on the pull request ( > > even in this year), which are not getting reflected on this compare link > > > > > > > > > > > > Regards, > > > > > > Arun Balajiee > > > > > > > > > > > > From: Wes McKinney<mailto:[email protected]> > > > > > > Sent: Tuesday, February 4, 2020 6:43 PM > > > > > > To: Parquet Dev<mailto:[email protected]> > > > > > > Cc: Deepak Majeti<mailto:[email protected]>; Anatoli > > Shein<mailto:[email protected]> > > > > > > Subject: Re: Arrow 1404: Adding index for Page-level Skipping > > > > > > > > > > > > Here's a compare link in case others want to have a look > > > > > > > > > > > > https://nam05.safelinks.protection.outlook.com/?url= > > https%3A%2F%2Fgithub.com%2Fapache%2Farrow%2Fcompare% > > 2Fmaster...a2un%3APARQUET-1404-Add-index-pages-to-the- > > format-to-support-efficient-page-skipping-to-parquet-cpp& > > amp;data=02%7C01%7CARL122%40pitt.edu%7Cd36ddd6e18fb44808ef308d7d729b8c8% > > > 7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C1%7C637214446509871841&sdata= > > rBl3pY6bRFuSzWg2QT2Ca6aui2HZJjSoh1mbzDq%2F93M%3D&reserved=0 > > > > > > > > > > > > On Tue, Feb 4, 2020 at 5:41 PM Wes McKinney <[email protected] > > > > wrote: > > > > > > > > > > > > > > hi Arun, > > > > > > > > > > > > > > I took a brief look at your branch. One thing that is missing > is > > the > > > > > > > proposed public APIs that use the index pages -- that would be > > very > > > > > > > helpful for this discussion. > > > > > > > > > > > > > > I don't think we have any code for doing random access of a > > particular > > > > > > > data page in a column chunk, so having as an initial matter > > would also > > > > > > > be helpful. > > > > > > > > > > > > > > - Wes > > > > > > > > > > > > > > On Tue, Feb 4, 2020 at 2:28 PM Lekshmi Narayanan, Arun Balajiee > > > > > > > <[email protected]> wrote: > > > > > > > > > > > > > > > > Hi Parquet dev > > > > > > > > > > > > > > > > Deepak Majeti was my dev lead during my summer internship, > > from when I am trying to add a few changes in the Arrow Parquet Project > for > > the ticket below > > > > > > > > > > > > > > > > https://nam05.safelinks.protection.outlook.com/?url= > > https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FPARQUET- > > 1404&data=02%7C01%7CARL122%40pitt.edu%7Cd36ddd6e18fb44808ef308d7d729 > > b8c8%7C9ef9f489e0a04eeb87cc3a526112fd0d%7C1%7C1%7C637214446509871841& > > sdata=IXX%2FwrAwPOFIAHl1WH4n6nNkq9JZ2asOf99dzIUxBN8%3D&reserved=0 > > (Assigned to Deepak) > > > > > > > > > > > > > > > > With this regard, I am making a few changes to > > src/parquet/file_reader.cc ( in a fork on my repository) > > > > > > > > > > > > > > > > https://nam05.safelinks.protection.outlook.com/?url= > > https%3A%2F%2Fgithub.com%2Fa2un%2Farrow%2Ftree%2FPARQUET-1404-Add-index- > > pages-to-the-format-to-support-efficient-page- > > skipping-to-parquet-cpp%2Fcpp&data=02%7C01%7CARL122%40pitt.edu% > > 7Cd36ddd6e18fb44808ef308d7d729b8c8%7C9ef9f489e0a04eeb87cc3a526112 > > fd0d%7C1%7C1%7C637214446509871841&sdata=ps%2FRPqvGv% > > 2F04f49yF0vPXBQv2Eu6mS8gZEW83Qg9Cv0%3D&reserved=0 > > > > > > > > > > > > > > > > I am stuck at trying to read a particular row using the index > > that I get in the page_location array struct of offset index. Could you > > help me with this ? and if there have been discussions on the forums for > > this as well, could you direct me to that link? > > > > > > > > > > > > > > > > Regards, > > > > > > > > Arun Balajiee > > > > > > > > > > > > > > > > > > > > > > > > >
