hadn't see that or why; will look at it. I'll probably suggestg one use of our LogExactlyOnce logger at info
there's another option here, for testing, which is : downgrade the logging on that class? it'd work across so many more releases On Sat, 13 Jan 2024 at 22:28, Atour Mousavi Gourabi <at...@live.com> wrote: > Hi Claire, all, > > Thanks for trying to pick this up over at Hadoop, it seems like a > reasonable change so I hope it gains some traction. In the meantime, I > propose we limit the scope of logging in the test suite. Info level logs > aren't awfully interesting in this case. IMO bumping it to warn or error by > default should suffice for our intents and purposes, greatly reducing the > overhead. As for the temp files, I'll look into setting up some teardown > routines somewhere next week. > > All the best, > Atour > ________________________________ > From: Claire McGinty <claire.d.mcgi...@gmail.com> > Sent: Friday, January 12, 2024 7:36 PM > To: dev@parquet.apache.org <dev@parquet.apache.org> > Cc: d...@parquet.incubator.apache.org <d...@parquet.incubator.apache.org> > Subject: Re: Guidelines for working on parquet-mr? > > Related to the noisy console logs, a few months ago I opened a ticket > <https://issues.apache.org/jira/browse/HADOOP-18717> for Hadoop to move > those CodecPool log statements from INFO to DEBUG, as they're a > significant contributor to log size and (IMO) don't add a ton of value for > the end user. It hasn't gotten traction so far, but I can try to move > forward with opening a pull request for it. > > Best, > Claire > > On Fri, Jan 12, 2024 at 7:36 AM Atour Mousavi Gourabi <at...@live.com> > wrote: > > > Hi Antoine, Gang, > > > > I fully agree with both of you that these shortcomings make development > on > > parquet-mr somewhat awkward. As for the duration the full test suite runs > > for, we won't really be able to decrease that. Instead, if you are only > > changing one or two modules it might suffice to just run tests for the > > module(s) you modified. Outside of just running the relevant test > fixtures > > in IntelliJ or any other IDE, this can also be done through Maven using > the > > following command: `mvn -pl :parquet-hadoop -am install -DskipTests && > mvn > > -pl :parquet-hadoop test` for the parquet-hadoop module for example. If > you > > want to run this command for multiple modules, run it with a comma > > delimited list of modules after the `-pl` option. So > > `:parquet-hadoop,:parquet-thrift` instead of `:parquet-hadoop` for both > > `parquet-hadoop` and `parquet-thrift`. If your changes for whatever > > unforeseen reason end up breaking stuff in other modules, the CI/CD in > > remote will catch it before the PR gets merged anyways. > > As for the issues around temp files and the console logs, I do think t > > might be worthwhile to look into fixing them. I myself have had some > > problems with disk partition sizes because of the huge amount of data the > > Maven lifecycle dumps in temp in the past, and the amount of logging is > > just unnecessary overhead. > > > > All the best, > > Atour > > ________________________________ > > From: Gang Wu <ust...@gmail.com> > > Sent: Friday, January 12, 2024 3:06 AM > > To: dev@parquet.apache.org <dev@parquet.apache.org> > > Cc: d...@parquet.incubator.apache.org <d...@parquet.incubator.apache.org> > > Subject: Re: Guidelines for working on parquet-mr? > > > > Hi Antoine, > > > > I agree that I have suffered the same thing while developing on > parquet-mr. > > Usually I don't make the full build and test unless for the release > > process. > > It would be much easier to use IntelliJ IDEA and run selected tests. > > > > Best, > > Gang > > > > On Fri, Jan 12, 2024 at 1:56 AM Antoine Pitrou <anto...@python.org> > wrote: > > > > > > > > Update: I finally Ctrl-C'ed the tests; they had left around 14 GB of > > > data in /tmp. > > > > > > Regards > > > > > > Antoine. > > > > > > > > > On Thu, 11 Jan 2024 18:48:20 +0100 > > > Antoine Pitrou <anto...@python.org> wrote: > > > > > > > Hello, > > > > > > > > I'm trying to build parquet-mr and I'm unsure how to make the > > > > experience smooth enough for development. This is what I observe: > > > > > > > > 1) running the tests is extremely long (they have been running for 10 > > > > minutes already, with no sign of nearing completion) > > > > > > > > 2) the output logs are a true firehose; there's a ton of extremely > > > > detailed (and probably superfluous) information being output, such > as: > > > > > > > > 2024-01-11 18:45:33 INFO CodecPool - Got brand-new compressor [.zstd] > > > > 2024-01-11 18:45:33 INFO CodecPool - Got brand-new decompressor [.gz] > > > > 2024-01-11 18:45:33 INFO CodecPool - Got brand-new compressor [.zstd] > > > > 2024-01-11 18:45:33 INFO CodecPool - Got brand-new decompressor [.gz] > > > > 2024-01-11 18:45:33 INFO CodecPool - Got brand-new compressor [.zstd] > > > > 2024-01-11 18:45:33 INFO ParquetRewriter - Finish rewriting input > file: > > > > file:/tmp/test12306662267168473656/test.parquet 2024-01-11 18:45:33 > > > > INFO InternalParquetRecordReader - RecordReader initialized will > read a > > > > total of 100000 records. 2024-01-11 18:45:33 INFO > > > > InternalParquetRecordReader - at row 0. reading next block 2024-01-11 > > > > 18:45:33 INFO CodecPool - Got brand-new decompressor [.zstd] > 2024-01-11 > > > > 18:45:33 INFO InternalParquetRecordReader - block read in memory in 1 > > > > ms. row count = 100 2024-01-11 18:45:33 INFO > > > > InternalParquetRecordReader - Assembled and processed 100 records > from > > > > 6 columns in 0 ms: Infinity rec/ms, Infinity cell/ms 2024-01-11 > > > > 18:45:33 INFO InternalParquetRecordReader - time spent so far 100% > > > > reading (1 ms) and 0% processing (0 ms) 2024-01-11 18:45:33 INFO > > > > InternalParquetRecordReader - at row 100. reading next block > 2024-01-11 > > > > 18:45:33 INFO InternalParquetRecordReader - block read in memory in 0 > > > > ms. row count = 100 2024-01-11 18:45:33 INFO > > > > InternalParquetRecordReader - Assembled and processed 200 records > from > > > > 6 columns in 1 ms: 200.0 rec/ms, 1200.0 cell/ms 2024-01-11 18:45:33 > > > > INFO InternalParquetRecordReader - time spent so far 50% reading (1 > ms) > > > > and 50% processing (1 ms) 2024-01-11 18:45:33 INFO > > > > InternalParquetRecordReader - at row 200. reading next block > 2024-01-11 > > > > 18:45:33 INFO InternalParquetRecordReader - block read in memory in 0 > > > > ms. row count = 100 2024-01-11 18:45:33 INFO > > > > InternalParquetRecordReader - Assembled and processed 300 records > from > > > > 6 columns in 1 ms: 300.0 rec/ms, 1800.0 cell/ms 2024-01-11 18:45:33 > > > > INFO InternalParquetRecordReader - time spent so far 50% reading (1 > ms) > > > > and 50% processing (1 ms) > > > > > > > > [etc.] > > > > > > > > > > > > 3) it seems the tests are leaving a lot of generated data files > behind > > > > in /tmp/test..., though of course they might ultimately clean up at > the > > > > end? > > > > > > > > > > > > How do people typically develop on parquet-mr? Do they have dedicated > > > > shell scripts that only build and test parts of the project? Do they > > > > use an IDE and select specific options there? > > > > > > > > Regards > > > > > > > > Antoine. > > > > > > > > > > > > > > > > > > > > > > > > > > >