Hi Antoine,

I agree that I have suffered the same thing while developing on parquet-mr.
Usually I don't make the full build and test unless for the release process.
It would be much easier to use IntelliJ IDEA and run selected tests.

Best,
Gang

On Fri, Jan 12, 2024 at 1:56 AM Antoine Pitrou <anto...@python.org> wrote:

>
> Update: I finally Ctrl-C'ed the tests; they had left around 14 GB of
> data in /tmp.
>
> Regards
>
> Antoine.
>
>
> On Thu, 11 Jan 2024 18:48:20 +0100
> Antoine Pitrou <anto...@python.org> wrote:
>
> > Hello,
> >
> > I'm trying to build parquet-mr and I'm unsure how to make the
> > experience smooth enough for development. This is what I observe:
> >
> > 1) running the tests is extremely long (they have been running for 10
> > minutes already, with no sign of nearing completion)
> >
> > 2) the output logs are a true firehose; there's a ton of extremely
> > detailed (and probably superfluous) information being output, such as:
> >
> > 2024-01-11 18:45:33 INFO CodecPool - Got brand-new compressor [.zstd]
> > 2024-01-11 18:45:33 INFO CodecPool - Got brand-new decompressor [.gz]
> > 2024-01-11 18:45:33 INFO CodecPool - Got brand-new compressor [.zstd]
> > 2024-01-11 18:45:33 INFO CodecPool - Got brand-new decompressor [.gz]
> > 2024-01-11 18:45:33 INFO CodecPool - Got brand-new compressor [.zstd]
> > 2024-01-11 18:45:33 INFO ParquetRewriter - Finish rewriting input file:
> > file:/tmp/test12306662267168473656/test.parquet 2024-01-11 18:45:33
> > INFO InternalParquetRecordReader - RecordReader initialized will read a
> > total of 100000 records. 2024-01-11 18:45:33 INFO
> > InternalParquetRecordReader - at row 0. reading next block 2024-01-11
> > 18:45:33 INFO CodecPool - Got brand-new decompressor [.zstd] 2024-01-11
> > 18:45:33 INFO InternalParquetRecordReader - block read in memory in 1
> > ms. row count = 100 2024-01-11 18:45:33 INFO
> > InternalParquetRecordReader - Assembled and processed 100 records from
> > 6 columns in 0 ms: Infinity rec/ms, Infinity cell/ms 2024-01-11
> > 18:45:33 INFO InternalParquetRecordReader - time spent so far 100%
> > reading (1 ms) and 0% processing (0 ms) 2024-01-11 18:45:33 INFO
> > InternalParquetRecordReader - at row 100. reading next block 2024-01-11
> > 18:45:33 INFO InternalParquetRecordReader - block read in memory in 0
> > ms. row count = 100 2024-01-11 18:45:33 INFO
> > InternalParquetRecordReader - Assembled and processed 200 records from
> > 6 columns in 1 ms: 200.0 rec/ms, 1200.0 cell/ms 2024-01-11 18:45:33
> > INFO InternalParquetRecordReader - time spent so far 50% reading (1 ms)
> > and 50% processing (1 ms) 2024-01-11 18:45:33 INFO
> > InternalParquetRecordReader - at row 200. reading next block 2024-01-11
> > 18:45:33 INFO InternalParquetRecordReader - block read in memory in 0
> > ms. row count = 100 2024-01-11 18:45:33 INFO
> > InternalParquetRecordReader - Assembled and processed 300 records from
> > 6 columns in 1 ms: 300.0 rec/ms, 1800.0 cell/ms 2024-01-11 18:45:33
> > INFO InternalParquetRecordReader - time spent so far 50% reading (1 ms)
> > and 50% processing (1 ms)
> >
> > [etc.]
> >
> >
> > 3) it seems the tests are leaving a lot of generated data files behind
> > in /tmp/test..., though of course they might ultimately clean up at the
> > end?
> >
> >
> > How do people typically develop on parquet-mr? Do they have dedicated
> > shell scripts that only build and test parts of the project? Do they
> > use an IDE and select specific options there?
> >
> > Regards
> >
> > Antoine.
> >
> >
> >
>
>
>
>

Reply via email to