[GitHub] spark pull request: [SPARK-3011][SQL] _temporary directory should ...

chutium Thu, 21 Aug 2014 03:14:02 -0700

Github user chutium commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1959#discussion_r16530668
  
    --- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTypes.scala ---
    @@ -373,9 +373,11 @@ private[parquet] object ParquetTypesConverter extends 
Logging {
         }
         ParquetRelation.enableLogForwarding()
     
    +    // NOTE: Explicitly list "_temporary" because hadoop 0.23 removed the 
variable TEMP_DIR_NAME
    +    // from FileOutputCommitter. Check MAPREDUCE-5229 for the detail.
         val children = fs.listStatus(path).filterNot { status =>
           val name = status.getPath.getName
    -      name(0) == '.' || name == FileOutputCommitter.SUCCEEDED_FILE_NAME
    +      name(0) == '.' || name == FileOutputCommitter.SUCCEEDED_FILE_NAME || 
name == "_temporary"
         }
     
         // NOTE (lian): Parquet "_metadata" file can be very slow if the file 
consists of lots of row
    --- End diff --
    
    hmm, a better solution for all of this could be:
    no ```val children = fs.listStatus(path)...``` any more
    
    then:
    ```
    val metafile = fs.listStatus(path).find(_.getPath.getName == 
ParquetFileWriter.PARQUET_METADATA_FILE)
    val datafile = fs.listStatus(path).find(isNotHiddenFile(_.getPath.getName))
    ```
    this ```isNotHiddenFile``` simply check like this ```(name(0) != '.' && 
name(0) != '_')```
    
    then something like:
    ```
    if datafile is not null
      return ParquetFileReader.readFooter(conf, datafile)
    else
      return ParquetFileReader.readFooter(conf, metafile)
    ```
    
    and moreover, @liancheng, after reading carefully following comments, 
finally i know what you mean "complete Parquet file on HDFS should be 
directory" https://github.com/apache/spark/pull/2044#issuecomment-52733594
    
    you mean the whole directory is "a single parquet file", and the files in 
it are "data"? but such a definition is really very very confusing... are you 
sure about this definition? i just googled, but found noting, only some like 
"Parquet files are self-describing so the schema is preserved"
    
    so, since they are self-describing, in my mind, each "data-file" in a 
parquet file folder is also valid parquet-format-file, it should also be able 
to take as an input source for parquet reader like our Spark SQLContext...



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-3011][SQL] _temporary directory should ...

Reply via email to