[jira] [Updated] (PARQUET-1953) hadoop-common is not an optional dependency

Sam Halliday (Jira) Tue, 15 Dec 2020 04:40:35 -0800


     [ 
https://issues.apache.org/jira/browse/PARQUET-1953?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sam Halliday updated PARQUET-1953:
----------------------------------
    Description: 
parquet-hadoop provides the only mechanism to load .parquet files and has an 
optional (provided) dependency on hadoop-common, implying that it is possible 
to use parquet-hadoop without using hadoop. However, it is required.

The following code is needed to instantiate a ParquetFileReader

{code}
final class LocalInputFile(file: File) extends InputFile {
  def getLength() = file.length()
  def newStream(): SeekableInputStream = {
    val input = new FileInputStream(file)
    new DelegatingSeekableInputStream(input) {
      def getPos(): Long = input.getChannel().position()
      def seek(bs: Long): Unit = {
        val _ = input.getChannel().position(bs)
      }
    }
  }
}
{code}

but using this leads to a runtime exception because hadoop is missing 
transitive dependency on org.apache.hadoop.fs.PathFilter which then depends on 
org.apache.hadoop.fs.Path, both in hadoop-common.

Requiring downstream users to depend on hadoop-common is an extremely large 
dependency and I would rather that this was not the case.

A search for "import org.apache.hadoop" in src/main reveals a few more places 
where the dependency is hardwired, although often in deprecated static 
constructors and therefore benign.

  was:
parquet-hadoop provides the only mechanism to road .parquet files and has an 
optional (provided) dependency on hadoop-common, implying that it is possible 
to use parquet-hadoop without using hadoop. However, it is required.

The following code is needed to instantiate a ParquetFileReader

{code}
final class LocalInputFile(file: File) extends InputFile {
  def getLength() = file.length()
  def newStream(): SeekableInputStream = {
    val input = new FileInputStream(file)
    new DelegatingSeekableInputStream(input) {
      def getPos(): Long = input.getChannel().position()
      def seek(bs: Long): Unit = {
        val _ = input.getChannel().position(bs)
      }
    }
  }
}
{code}

but using this leads to a runtime exception because hadoop is missing 
transitive dependency on org.apache.hadoop.fs.PathFilter which then depends on 
org.apache.hadoop.fs.Path, both in hadoop-common.

Requiring downstream users to depend on hadoop-common is an extremely large 
dependency and I would rather that this was not the case.

A search for "import org.apache.hadoop" in src/main reveals a few more places 
where the dependency is hardwired, although often in deprecated static 
constructors and therefore benign.


> hadoop-common is not an optional dependency
> -------------------------------------------
>
>                 Key: PARQUET-1953
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1953
>             Project: Parquet
>          Issue Type: Bug
>            Reporter: Sam Halliday
>            Priority: Major
>
> parquet-hadoop provides the only mechanism to load .parquet files and has an 
> optional (provided) dependency on hadoop-common, implying that it is possible 
> to use parquet-hadoop without using hadoop. However, it is required.
> The following code is needed to instantiate a ParquetFileReader
> {code}
> final class LocalInputFile(file: File) extends InputFile {
>   def getLength() = file.length()
>   def newStream(): SeekableInputStream = {
>     val input = new FileInputStream(file)
>     new DelegatingSeekableInputStream(input) {
>       def getPos(): Long = input.getChannel().position()
>       def seek(bs: Long): Unit = {
>         val _ = input.getChannel().position(bs)
>       }
>     }
>   }
> }
> {code}
> but using this leads to a runtime exception because hadoop is missing 
> transitive dependency on org.apache.hadoop.fs.PathFilter which then depends 
> on org.apache.hadoop.fs.Path, both in hadoop-common.
> Requiring downstream users to depend on hadoop-common is an extremely large 
> dependency and I would rather that this was not the case.
> A search for "import org.apache.hadoop" in src/main reveals a few more places 
> where the dependency is hardwired, although often in deprecated static 
> constructors and therefore benign.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Updated] (PARQUET-1953) hadoop-common is not an optional dependency

Reply via email to