​Hello Sky,
First, I'm sorry I missed your note the week it came in.

As I can read your questions from several different perspectives, I'll just
share a few general ideas and suggestions.

There are a few ways to connect up Impala with lots of data.  Several of
them trade off preparation time and effort in advance in exchange for
performance with reduced error checking, for example.  A series of INSERT
statements is inefficient, as you point out, because it does not amortize
the per-query overhead over the volume of data, and it checks every value
of every incoming row.

It's not clear which imperfections of Sqoop you refer to, however Impala
does support loading data into HDFS with Sqoop, then defining a schema on
top of it after the fact.  If you know your complete schema and have high
confidence it fits the data you loaded, you can use CREATE TABLE ...
LOCATION ... to make the new definition point to the newly-loaded files.
If you load partitioned data, you can follow these commands with ALTER
TABLE ... RECOVER PARTITIONS and Impala will find new rows loaded into
partition directories and bind them to the table.

Impala has a limited ability to discover a schema for loaded data, if the
destination format contains enough metadata.  For example, you could load
data into HDFS in Parquet format, then issue CREATE TABLE ... LIKE PARQUET
..., referencing the new files, and Impala will build that table's metadata
from the files.  Column types would be limited to those representable in
Parquet, and Parquet is the only format for which Impala implements this
feature.

Finally, the LOAD DATA command allows you to populate already-created
tables in Impala with data from another file *already stored in HDFS*. LOAD
DATA does not populate tables from arbitrary files in the OS filesystem
namespace.

Hope this helps!
TW


---------- Forwarded message ----------
> From: sky <[email protected]>
> Date: Wed, Sep 13, 2017 at 7:08 PM
> ​Im​
>
> Subject: Data Transfer Between Different Databases
> To: "[email protected]" <[email protected]>
>
>
> Hi all,
>     How does impala interact data with other relational databases?
>  Sqoop's functionality is not perfect, and in impala, each insert has 100ms
> query plan overhead. Are there any other easy ways to interact ?
>
>

Reply via email to