Re: [mlpack] GSoC-2021

Gopi Manohar Tatiraju Fri, 02 Apr 2021 11:39:41 -0700

Hey Omar,

I did figure it out at last. It took some time though and I am not sure if
this is the only method.
There can be more efficient ways but I think we can work with this for now.


Check the code, but I am reading it *row-wise* for fast CSV.
*Code file*:
https://github.com/heisenbuug/Benchmark-CSV-Parsers/blob/main/benchmark.cpp

See the last part, i.e. load using fast CSV
Let me know if you see any necessary changes.

Thanks,
Gopi

On Fri, Apr 2, 2021 at 10:55 PM Omar Shrit <o...@shrit.me> wrote:

> Great, can not wait to see the results.
>
> I did not have a chance to look at the API of the fast cpp, there is
> no examples or clear doc, it requires reading the code directly.
> I will let you know if I have anything.
>
> Best,
>
> Omar
>
> On 04/02, Gopi Manohar Tatiraju wrote:
> > Heyy,
> >
> > So I was not able to work out fast csv, but I edited the existing code to
> > read the whole data column-wise,
> > each column is returned to us as a std::vector which I then converted to
> > arma::vec and then at the end
> > insert the column into an arma::mat.
> > Suggested code changes:
> >
> >
> > >       arma::fmat mat(doc.GetRowCount(), doc.GetColumnCount());
> > >       std::vector<float> column;
> > >       for(int i = 0; i < doc.GetColumnCount(); i++)
> > >       {
> > >         column = doc.GetColumn<float>(i);
> > >         arma::fvec column_vector(column);
> > >         mat.col(i) = column_vector;
> > >       }
> >
> >
> > I am running the benchmark code, it's gonna take some time, so I will
> > upload the code finishes compiling.
> > Also, any idea regarding the other parser would help.
> >
> > Thanks,
> > Gopi
> >
> > On Fri, Apr 2, 2021 at 12:47 AM Gopi Manohar Tatiraju <
> deathcod...@gmail.com>
> > wrote:
> >
> > > Hey,
> > >
> > > Was working on it.
> > > Here's the link:
> > >
> https://github.com/heisenbuug/Benchmark-CSV-Parsers/blob/main/csvparser_log_check.ipynb
> > >
> > > Thanks,
> > > Gopi
> > >
> > > On Fri, Apr 2, 2021 at 12:28 AM Omar Shrit <o...@shrit.me> wrote:
> > >
> > >> Hello Gopi,
> > >>
> > >> Thank you for starting the benchmark, would it be possible to plot the
> > >> log and add the results to the open pull request to get a better
> > >> comparison?
> > >>
> > >> The code seems to be fine, it can be optimized, but I would wait to
> see
> > >> the plots.
> > >>
> > >> Thanks,
> > >>
> > >> Omar
> > >>
> > >> On 04/01, Gopi Manohar Tatiraju wrote:
> > >> > Hey Omar,
> > >> >
> > >> > Sorry, it took longer. I was running benchmark code since this
> morning
> > >> and
> > >> > it took a lot of time as my system is a bit slow.
> > >> > I compared the default armadillo parser, mlpack's custom parser, and
> > >> > rapidcsv.
> > >> >
> > >> > Can you verify the code I used? I might have done something wrong
> and it
> > >> > took a lot of time to run this code, but that is maybe due to the
> fact
> > >> that
> > >> > my system is not that powerful.
> > >> > *Link to the repo and log file:*
> > >> > https://github.com/heisenbuug/Benchmark-CSV-Parsers
> > >> >
> > >> > In the meantime, I will also start working on my draft proposal a
> bit,
> > >> and
> > >> > once we do this testing we can use those results to decide our
> > >> > plan of action. Let me know if you have any suggestions or points
> for
> > >> the
> > >> > draft proposal.
> > >> >
> > >> > Thank you,
> > >> > Gopi M. Tatiraju
> > >> >
> > >> >
> > >> > On Thu, Apr 1, 2021 at 5:59 PM Omar Shrit <o...@shrit.me> wrote:
> > >> >
> > >> > > Hello Gopi,
> > >> > >
> > >> > > Would it be possible to do some benchmark for these two and
> compare
> > >> > > them with already existing Boost Spirit. If there is a
> considerable
> > >> > > difference
> > >> > > in performance between these two parsers, then the obvious choice
> will
> > >> > > be for the faster one. I know that both of them are called (fast,
> > >> rapid)
> > >> > > but I did not see any benchmark yet to know which one is faster.
> > >> > >
> > >> > > Let me know what do you think, the benchmark will help us in doing
> > >> better
> > >> > > choice, since this is the internal (private) API, and will not be
> used
> > >> > > by the user directly.
> > >> > >
> > >> > > These are my thoughts, let me know what do you think.
> > >> > >
> > >> > > Omar.
> > >> > >
> > >> > > On 04/01, Gopi Manohar Tatiraju wrote:
> > >> > > > Hey,
> > >> > > >
> > >> > > > So, I want through both the libraries we considered for `csv
> > >> parsers`
> > >> > > > I implemented code to load the data from a small example `csv`
> file
> > >> > > > to arma::mat, here is the sample code, let me know what you
> think.
> > >> > > > I am loading into wrong in arma::mat? Can there be any other
> > >> efficient
> > >> > > > way?
> > >> > > >
> > >> > > > Fast CSV Parser <
> > >> https://github.com/ben-strasser/fast-cpp-csv-parser>
> > >> > > > io::CSVReader<4> in("llog.csv");
> > >> > > > float a, b, c, d;
> > >> > > > int row = 0;
> > >> > > > arma::mat data(20, 4);
> > >> > > >
> > >> > > > while(in.read_row(a, b, c, d)){
> > >> > > > data(row, 0) = a;
> > >> > > > data(row, 1) = b;
> > >> > > > data(row, 2) = c;
> > >> > > > data(row, 3) = d;
> > >> > > > row++;
> > >> > > > }
> > >> > > >
> > >> > > > Rapid.csv <https://github.com/d99kris/rapidcsv>
> > >> > > > // For headerless csv files
> > >> > > > rapidcsv::Document doc("llog.csv", rapidcsv::LabelParams(-1,
> -1));
> > >> > > > arma::mat data(doc.GetRowCount(), doc.GetColumnCount(),
> > >> > > arma::fill::ones);
> > >> > > >
> > >> > > > std::vector<float> col;
> > >> > > > for(int i = 0; i < doc.GetRowCount(); i++)
> > >> > > > {
> > >> > > > col = doc.GetRow<float>(i);
> > >> > > > for(int j = 0; j < doc.GetColumnCount(); j++)
> > >> > > > {
> > >> > > > data(i, j) = col[j];
> > >> > > > }
> > >> > > > }
> > >> > > >
> > >> > > > After using both a I feel like `rapid.csv` is easier to grasp
> and
> > >> work on
> > >> > > > and seemed more structured.
> > >> > > > Let me know your thoughts. Also If loading like the above
> example is
> > >> > > file,
> > >> > > > this can be converted
> > >> > > > into a function that can act as basic csv file loading in
> arma::mat,
> > >> > > right?
> > >> > > >
> > >> > > > Thank You,
> > >> > > > Gopi
> > >> > > >
> > >> > > > On Mon, Mar 29, 2021 at 8:28 PM Omar Shrit <o...@shrit.me>
> wrote:
> > >> > > >
> > >> > > > > Hey Gopi
> > >> > > > >
> > >> > > > > On 03/29, Gopi Manohar Tatiraju wrote:
> > >> > > > > > Hey,
> > >> > > > > >
> > >> > > > > > I agree, after going a bit through both the candidates I can
> > >> see we
> > >> > > can
> > >> > > > > > unload a lot of work by using a well-implemented existing
> > >> parser.
> > >> > > > > > I think I should start by comparing both the mentioned
> > >> libraries to
> > >> > > > > decide
> > >> > > > > > which one to use. I will use the same benchmark strategy
> that
> > >> > > > > > was discussed in the issue. Does that sound good?
> > >> > > > >
> > >> > > > > Sounds good to me.
> > >> > > > >
> > >> > > > > > And also I think I can work on replacing boost spirits in
> GSoC
> > >> then.
> > >> > > This
> > >> > > > > > will be a start to the data frame idea. Even if we are left
> > >> with time
> > >> > > > > > after this, I can start the work on the data frame as well.
> Is
> > >> it
> > >> > > > > > considerable?
> > >> > > > >
> > >> > > > > Yes of course.
> > >> > > > >
> > >> > > > > > Thanks,
> > >> > > > > > Gopi
> > >> > > > > >
> > >> > > > > >
> > >> > > > > > On Mon, Mar 29, 2021 at 7:33 PM Omar Shrit <o...@shrit.me>
> > >> wrote:
> > >> > > > > >
> > >> > > > > > > Hey Gopi,
> > >> > > > > > >
> > >> > > > > > > I totally agree with Ryan, using existing parser will
> > >> accelerate
> > >> > > the
> > >> > > > > > > project and allow to move forward with the dataframe
> class.
> > >> Also, I
> > >> > > > > > > do believe that replacing boost Spirit with an existing
> > >> parser will
> > >> > > > > take
> > >> > > > > > > a considerable amount of the summer.
> > >> > > > > > >
> > >> > > > > > > Thanks,
> > >> > > > > > >
> > >> > > > > > > Omar
> > >> > > > > > >
> > >> > > > > > > On 03/29, Ryan Curtin wrote:
> > >> > > > > > > > On Mon, Mar 29, 2021 at 04:17:35PM +0530, Gopi Manohar
> > >> Tatiraju
> > >> > > > > wrote:
> > >> > > > > > > > > Would love to hear your thoughts on whether to go
> with an
> > >> > > already
> > >> > > > > > > > > implemented parser or build a new one. Also if we are
> > >> planning
> > >> > > to
> > >> > > > > > > build a
> > >> > > > > > > > > data frame here then
> > >> > > > > > > > > maybe going with an in-house parser would be better
> as we
> > >> will
> > >> > > > > have the
> > >> > > > > > > > > ability to design it in such a way that it can extend
> > >> maximum
> > >> > > > > support
> > >> > > > > > > to
> > >> > > > > > > > > the new data frame
> > >> > > > > > > > > which we are planning to build ahead.
> > >> > > > > > > >
> > >> > > > > > > > Hey Gopi,
> > >> > > > > > > >
> > >> > > > > > > > Honestly I think it's best to use another package.  Not
> > >> only will
> > >> > > > > this
> > >> > > > > > > > free up time to actually work on the dataframe class,
> but
> > >> also it
> > >> > > > > means
> > >> > > > > > > > we are not responsible for maintenance of the CSV
> parser.
> > >> There
> > >> > > are
> > >> > > > > > > > lots of little complexities and edge cases in parsing
> (not
> > >> to
> > >> > > mention
> > >> > > > > > > > efficiency!) and so we can probably get a lot more bang
> for
> > >> our
> > >> > > buck
> > >> > > > > > > > here by using an implementation from someone who has
> > >> already put
> > >> > > down
> > >> > > > > > > > the time to consider all those details.
> > >> > > > > > > >
> > >> > > > > > > > Hope this is helpful. :)
> > >> > > > > > > >
> > >> > > > > > > > Thanks,
> > >> > > > > > > >
> > >> > > > > > > > Ryan
> > >> > > > > > > >
> > >> > > > > > > > --
> > >> > > > > > > > Ryan Curtin    | "Kill them, Machine... kill them all."
> > >> > > > > > > > r...@ratml.org |   - Dino Velvet
> > >> > > > > > >
> > >> > > > >
> > >> > >
> > >>
> > >
>

_______________________________________________
mlpack mailing list
mlpack@lists.mlpack.org
http://knife.lugatgt.org/cgi-bin/mailman/listinfo/mlpack

Re: [mlpack] GSoC-2021

Reply via email to