[jira] [Commented] (SQOOP-1390) Import data to HDFS as a set of Parquet files

Pratik Khadloya (JIRA) Wed, 20 Aug 2014 13:22:39 -0700

    [ 
https://issues.apache.org/jira/browse/SQOOP-1390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14104497#comment-14104497
 ]


Pratik Khadloya commented on SQOOP-1390:
----------------------------------------

When querying the table i created on the parquet directory, i am getting a 
snappy related error on my mac. I think it is related to some bug in the 
snappy-java lib that i haven't yet figured out how to work around.

{code}
Caused by: java.lang.UnsatisfiedLinkError: no snappyjava in java.library.path
        at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1886)
        at java.lang.Runtime.loadLibrary0(Runtime.java:849)
        at java.lang.System.loadLibrary(System.java:1088)
        at 
org.xerial.snappy.SnappyNativeLoader.loadLibrary(SnappyNativeLoader.java:52)
        ... 43 more
Exception in thread "main" org.xerial.snappy.SnappyError: 
[FAILED_TO_LOAD_NATIVE_LIBRARY] null
{code}

Also, do you know if it is optional to mention the field names when creating 
the hive table? Would hive figure out all the columns of the table based on the 
schema that is packed into the parquet file?

If we create a hive table on an avro file we do not need to mention the 
columns. It automatically figures out based on the schema file.

> Import data to HDFS as a set of Parquet files
> ---------------------------------------------
>
>                 Key: SQOOP-1390
>                 URL: https://issues.apache.org/jira/browse/SQOOP-1390
>             Project: Sqoop
>          Issue Type: Sub-task
>          Components: tools
>            Reporter: Qian Xu
>            Assignee: Qian Xu
>             Fix For: 1.4.6
>
>         Attachments: SQOOP-1390.patch
>
>
> Parquet files keep data in contiguous chunks by column, appending new records 
> to a dataset requires rewriting substantial portions of existing a file or 
> buffering records to create a new file. 
> The JIRA proposes to add the possibility to import an individual table from a 
> RDBMS into HDFS as a set of Parquet files. We will also provide a 
> command-line interface with a new argument {{--as-parquetfile}} 
> Example invocation: 
> {{sqoop import --connect JDBC_URI --table TABLE --as-parquetfile --target-dir 
> /path/to/files}}
> The major items are listed as follows:
> * Implement ParquetImportMapper
> * Hook up the ParquetOutputFormat and ParquetImportMapper in the import job.
> * Be able to support import from scratch or in append mode
> Note that as Parquet is a columnar storage format, it doesn't make sense to 
> write to it directly from record-based tools. So we'd consider to use Kite 
> SDK to simplify the handling of Parquet specific things.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (SQOOP-1390) Import data to HDFS as a set of Parquet files

Reply via email to