Github user chutium commented on a diff in the pull request:
https://github.com/apache/spark/pull/1959#discussion_r16530668
--- Diff:
sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTypes.scala ---
@@ -373,9 +373,11 @@ private[parquet] object ParquetTypesConverter extends
Logging {
}
ParquetRelation.enableLogForwarding()
+ // NOTE: Explicitly list "_temporary" because hadoop 0.23 removed the
variable TEMP_DIR_NAME
+ // from FileOutputCommitter. Check MAPREDUCE-5229 for the detail.
val children = fs.listStatus(path).filterNot { status =>
val name = status.getPath.getName
- name(0) == '.' || name == FileOutputCommitter.SUCCEEDED_FILE_NAME
+ name(0) == '.' || name == FileOutputCommitter.SUCCEEDED_FILE_NAME ||
name == "_temporary"
}
// NOTE (lian): Parquet "_metadata" file can be very slow if the file
consists of lots of row
--- End diff --
hmm, a better solution for all of this could be:
no ```val children = fs.listStatus(path)...``` any more
then:
```
val metafile = fs.listStatus(path).find(_.getPath.getName ==
ParquetFileWriter.PARQUET_METADATA_FILE)
val datafile = fs.listStatus(path).find(isNotHiddenFile(_.getPath.getName))
```
this ```isNotHiddenFile``` simply check like this ```(name(0) != '.' &&
name(0) != '_')```
then something like:
```
if datafile is not null
return ParquetFileReader.readFooter(conf, datafile)
else
return ParquetFileReader.readFooter(conf, metafile)
```
and moreover, @liancheng, after reading carefully following comments,
finally i know what you mean "complete Parquet file on HDFS should be
directory" https://github.com/apache/spark/pull/2044#issuecomment-52733594
you mean the whole directory is "a single parquet file", and the files in
it are "data"? but such a definition is really very very confusing... are you
sure about this definition? i just googled, but found noting, only some like
"Parquet files are self-describing so the schema is preserved"
so, since they are self-describing, in my mind, each "data-file" in a
parquet file folder is also valid parquet-format-file, it should also be able
to take as an input source for parquet reader like our Spark SQLContext...
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]