Re: Questions about using Hudi

nishith agarwal Tue, 08 Oct 2019 13:55:12 -0700

Qian,

(1) -> The target table (Hudi table) will be automatically created by the
HDFSImporter tool. You don't need to manually create this.
(2) -> Hudi ingests data based on the AVRO schema provided by clients.
Since the importer tool goes through the same code paths, we require an
AVRO schema to be passed it, which is the latest schema. Ideally, we could
get the schema from parquet files, but since schemas go through an
evolution process, different parquet files may have different schemas.
(3) You can put the AVRO schema in the schema file, the avro schema is
simply in JSON format so as long as you put that in the schema file, it
should work.


I see that the SESSION_DATE field is present in your schema. What can
happen is that some records might not have this field populated and when
that happens, we will not be able to assign a partition-path for that
record and will result in the above exception that you see. Are you sure
that all your existing records in parquet has this field populated (i.e NOT
null) ?

Thanks,
Nishith

On Tue, Oct 8, 2019 at 10:55 AM Qian Wang <qwang1...@gmail.com> wrote:

> Hi Nishith,
>
> Thanks for your response.
> The session_date is one field in my original dataset. I have some
> questions about the schema parameter:
>
> 1. Do I need create the target table?
> 2. My source data is Parquet format, why the tool need the schema file as
> the parameter?
> 3. Can I use the schema file of Avro format?
>
> The schema is looks like:
>
> {"type":"record","name":"PathExtractData","doc":"Path event extract fact
> data”,”fields”:[
>     {“name”:”SESSION_DATE”,”type”:”string”},
>     {“name”:”SITE_ID”,”type”:”int”},
>     {“name”:”GUID”,”type”:”string”},
>     {“name”:”SESSION_KEY”,”type”:”long”},
>     {“name”:”USER_ID”,”type”:”string”},
>     {“name”:”STEP”,”type”:”int”},
>     {“name”:”PAGE_ID”,”type”:”int”}
> ]}
>
> Thanks.
>
> Best,
> Qian
> On Oct 8, 2019, 10:47 AM -0700, nishith agarwal <n3.nas...@gmail.com>,
> wrote:
> > Qian,
> >
> > It looks like the partitionPathField that you specified (session_date) is
> > missing or the code is unable to grab it from your payload. Is this
> field a
> > top-level field or a nested field in your schema ?
> > ( Currently, the HDFSImporterTool looks for your partitionPathField only
> at
> > the top-level, for example genericRecord.get("session_date") )
> >
> > Thanks,
> > Nishith
> >
> >
> > On Tue, Oct 8, 2019 at 10:12 AM Qian Wang <qwang1...@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > Thanks for your response.
> > >
> > > Now I tried to convert existing dataset to Hudi managed dataset and I
> used
> > > the hdfsparquestimport in hud-cli. I encountered following error:
> > >
> > > 19/10/08 09:50:59 INFO DAGScheduler: Job 1 failed: countByKey at
> > > HoodieBloomIndex.java:148, took 2.913761 s
> > > 19/10/08 09:50:59 ERROR HDFSParquetImporter: Error occurred.
> > > org.apache.hudi.exception.HoodieUpsertException: Failed to upsert for
> > > commit time 20191008095056
> > >
> > > Caused by: org.apache.hudi.exception.HoodieIOException: partition key
> is
> > > missing. :session_date
> > >
> > > My command in hud-cli as following:
> > > hdfsparquetimport --upsert false --srcPath /path/to/source --targetPath
> > > /path/to/target --tableName xxx --tableType COPY_ON_WRITE --rowKeyField
> > > _row_key --partitionPathField session_date --parallelism 1500
> > > --schemaFilePath /path/to/avro/schema --format parquet --sparkMemory 6g
> > > --retry 2
> > >
> > > Could you please tell me how to solve this problem? Thanks.
> > >
> > > Best,
> > > Qian
> > > On Oct 6, 2019, 9:15 AM -0700, Qian Wang <qwang1...@gmail.com>, wrote:
> > > > Hi,
> > > >
> > > > I have some questions when I try to use Hudi in my company’s prod
> env:
> > > >
> > > > 1. When I migrate the history table in HDFS, I tried use hudi-cli and
> > > HDFSParquetImporter tool. How can I specify Spark parameters in this
> tool,
> > > such as Yarn queue, etc?
> > > > 2. Hudi needs to write metadata to Hive and it uses
> HiveMetastoreClient
> > > and HiveJDBC. How can I do if the Hive has Kerberos Authentication?
> > > >
> > > > Thanks.
> > > >
> > > > Best,
> > > > Qian
> > >
>

Re: Questions about using Hudi

Reply via email to