Hello Sky, First, I'm sorry I missed your note the week it came in. As I can read your questions from several different perspectives, I'll just share a few general ideas and suggestions.
There are a few ways to connect up Impala with lots of data. Several of them trade off preparation time and effort in advance in exchange for performance with reduced error checking, for example. A series of INSERT statements is inefficient, as you point out, because it does not amortize the per-query overhead over the volume of data, and it checks every value of every incoming row. It's not clear which imperfections of Sqoop you refer to, however Impala does support loading data into HDFS with Sqoop, then defining a schema on top of it after the fact. If you know your complete schema and have high confidence it fits the data you loaded, you can use CREATE TABLE ... LOCATION ... to make the new definition point to the newly-loaded files. If you load partitioned data, you can follow these commands with ALTER TABLE ... RECOVER PARTITIONS and Impala will find new rows loaded into partition directories and bind them to the table. Impala has a limited ability to discover a schema for loaded data, if the destination format contains enough metadata. For example, you could load data into HDFS in Parquet format, then issue CREATE TABLE ... LIKE PARQUET ..., referencing the new files, and Impala will build that table's metadata from the files. Column types would be limited to those representable in Parquet, and Parquet is the only format for which Impala implements this feature. Finally, the LOAD DATA command allows you to populate already-created tables in Impala with data from another file *already stored in HDFS*. LOAD DATA does not populate tables from arbitrary files in the OS filesystem namespace. Hope this helps! TW ---------- Forwarded message ---------- > From: sky <[email protected]> > Date: Wed, Sep 13, 2017 at 7:08 PM > Im > > Subject: Data Transfer Between Different Databases > To: "[email protected]" <[email protected]> > > > Hi all, > How does impala interact data with other relational databases? > Sqoop's functionality is not perfect, and in impala, each insert has 100ms > query plan overhead. Are there any other easy ways to interact ? > >
