Hello,

I'm trying to build parquet-mr and I'm unsure how to make the
experience smooth enough for development. This is what I observe:

1) running the tests is extremely long (they have been running for 10
minutes already, with no sign of nearing completion)

2) the output logs are a true firehose; there's a ton of extremely
detailed (and probably superfluous) information being output, such as:

2024-01-11 18:45:33 INFO CodecPool - Got brand-new compressor [.zstd]
2024-01-11 18:45:33 INFO CodecPool - Got brand-new decompressor [.gz]
2024-01-11 18:45:33 INFO CodecPool - Got brand-new compressor [.zstd]
2024-01-11 18:45:33 INFO CodecPool - Got brand-new decompressor [.gz]
2024-01-11 18:45:33 INFO CodecPool - Got brand-new compressor [.zstd]
2024-01-11 18:45:33 INFO ParquetRewriter - Finish rewriting input file:
file:/tmp/test12306662267168473656/test.parquet 2024-01-11 18:45:33
INFO InternalParquetRecordReader - RecordReader initialized will read a
total of 100000 records. 2024-01-11 18:45:33 INFO
InternalParquetRecordReader - at row 0. reading next block 2024-01-11
18:45:33 INFO CodecPool - Got brand-new decompressor [.zstd] 2024-01-11
18:45:33 INFO InternalParquetRecordReader - block read in memory in 1
ms. row count = 100 2024-01-11 18:45:33 INFO
InternalParquetRecordReader - Assembled and processed 100 records from
6 columns in 0 ms: Infinity rec/ms, Infinity cell/ms 2024-01-11
18:45:33 INFO InternalParquetRecordReader - time spent so far 100%
reading (1 ms) and 0% processing (0 ms) 2024-01-11 18:45:33 INFO
InternalParquetRecordReader - at row 100. reading next block 2024-01-11
18:45:33 INFO InternalParquetRecordReader - block read in memory in 0
ms. row count = 100 2024-01-11 18:45:33 INFO
InternalParquetRecordReader - Assembled and processed 200 records from
6 columns in 1 ms: 200.0 rec/ms, 1200.0 cell/ms 2024-01-11 18:45:33
INFO InternalParquetRecordReader - time spent so far 50% reading (1 ms)
and 50% processing (1 ms) 2024-01-11 18:45:33 INFO
InternalParquetRecordReader - at row 200. reading next block 2024-01-11
18:45:33 INFO InternalParquetRecordReader - block read in memory in 0
ms. row count = 100 2024-01-11 18:45:33 INFO
InternalParquetRecordReader - Assembled and processed 300 records from
6 columns in 1 ms: 300.0 rec/ms, 1800.0 cell/ms 2024-01-11 18:45:33
INFO InternalParquetRecordReader - time spent so far 50% reading (1 ms)
and 50% processing (1 ms)

[etc.]


3) it seems the tests are leaving a lot of generated data files behind
in /tmp/test..., though of course they might ultimately clean up at the
end?


How do people typically develop on parquet-mr? Do they have dedicated
shell scripts that only build and test parts of the project? Do they
use an IDE and select specific options there?

Regards

Antoine.


Reply via email to