[GitHub] [spark] tooptoop4 commented on pull request #33332: [SPARK-36147][SQL] Warn if less files visible after stats write in BasicWriteStatsTracker

GitBox Thu, 15 Jul 2021 05:47:20 -0700


tooptoop4 commented on pull request #33332:
URL: https://github.com/apache/spark/pull/33332#issuecomment-880665317



   s3a url, bucket is in one of asia pacific ones, it is accessed via a HTTP 
proxy and with STS.
   
   some snippets of the log:
   ```
   21-07-12 11:58:36 INFO 
org.apache.spark.sql.execution.datasources.FileScanRDD: Reading File path: 
s3a://xxxx/yyyy.csv, range: 0-1951, partition values: [empty row]
   21-07-12 11:58:37 INFO 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: Code generated 
in 9.010979 ms
   21-07-12 11:58:37 INFO 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: Code generated 
in 8.593598 ms
   21-07-12 11:58:37 INFO 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: File Output 
Committer Algorithm version is 2
   21-07-12 11:58:37 INFO 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: FileOutputCommitter 
skip cleanup _temporary folders under output directory:false, ignore cleanup 
failures: false
   21-07-12 11:58:37 INFO 
org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol: 
Using user defined output committer class 
org.apache.parquet.hadoop.ParquetOutputCommitter
   21-07-12 11:58:37 INFO 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: File Output 
Committer Algorithm version is 2
   21-07-12 11:58:37 INFO 
org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter: FileOutputCommitter 
skip cleanup _temporary folders under output directory:false, ignore cleanup 
failures: false
   21-07-12 11:58:37 INFO 
org.apache.spark.sql.execution.datasources.SQLHadoopMapReduceCommitProtocol: 
Using output committer class org.apache.parquet.hadoop.ParquetOutputCommitter
   21-07-12 11:58:37 INFO 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: Code generated 
in 6.470666 ms
   21-07-12 11:58:37 INFO 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: Code generated 
in 4.507112 ms
   21-07-12 11:58:37 INFO 
org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator: Code generated 
in 13.859354 ms
   21-07-12 11:58:37 INFO org.apache.parquet.hadoop.codec.CodecConfig: 
Compression: SNAPPY
   21-07-12 11:58:37 INFO org.apache.parquet.hadoop.codec.CodecConfig: 
Compression: SNAPPY
   21-07-12 11:58:37 INFO org.apache.parquet.hadoop.ParquetOutputFormat: 
Parquet block size to 134217728
   21-07-12 11:58:37 INFO org.apache.parquet.hadoop.ParquetOutputFormat: 
Parquet page size to 1048576
   21-07-12 11:58:37 INFO org.apache.parquet.hadoop.ParquetOutputFormat: 
Parquet dictionary page size to 1048576
   21-07-12 11:58:37 INFO org.apache.parquet.hadoop.ParquetOutputFormat: 
Dictionary is on
   21-07-12 11:58:37 INFO org.apache.parquet.hadoop.ParquetOutputFormat: 
Validation is off
   21-07-12 11:58:37 INFO org.apache.parquet.hadoop.ParquetOutputFormat: Writer 
version is: PARQUET_1_0
   21-07-12 11:58:37 INFO org.apache.parquet.hadoop.ParquetOutputFormat: 
Maximum row group padding size is 8388608 bytes
   21-07-12 11:58:37 INFO org.apache.parquet.hadoop.ParquetOutputFormat: Page 
size checking is: estimated
   21-07-12 11:58:37 INFO org.apache.parquet.hadoop.ParquetOutputFormat: Min 
row count for page size check is: 100
   21-07-12 11:58:37 INFO org.apache.parquet.hadoop.ParquetOutputFormat: Max 
row count for page size check is: 10000
   21-07-12 11:58:37 INFO org.apache.parquet.hadoop.ParquetOutputFormat: 
Truncate length for column indexes is: 64
   21-07-12 11:58:37 INFO org.apache.parquet.hadoop.ParquetOutputFormat: Page 
row count limit to 20000
   21-07-12 11:58:37 INFO org.apache.hadoop.io.compress.CodecPool: Got 
brand-new compressor [.snappy]
   21-07-12 11:58:38 INFO 
org.apache.spark.sql.execution.datasources.BasicWriteTaskStatsTracker: Expected 
1 files, but only saw 0. This could be due to the output format not writing 
empty files, or files being not immediately visible in the filesystem.
   21-07-12 11:58:38 INFO org.apache.spark.mapred.SparkHadoopMapRedUtil: No 
need to commit output of task because needsTaskCommit=false: 
attempt_xxxx836_0021_m_000000_617
   ```
   
   can't see a fat jar of cloudstore and don't have access to hadoop 
installation atm
   ```
   java -jar cloudstore-1.0.jar storediag
   no main manifest attribute, in cloudstore-1.0.jar
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] tooptoop4 commented on pull request #33332: [SPARK-36147][SQL] Warn if less files visible after stats write in BasicWriteStatsTracker

Reply via email to