[Orthogonal comment] It's so awesome to see us troubleshooting together.. Thanks everyone on this thread!
On Tue, Sep 17, 2019 at 8:04 PM Taher Koitawala <taher...@gmail.com> wrote: > No there are no nulls in the data and I am getting the same error. > > On Wed, Sep 18, 2019, 3:33 AM Kabeer Ahmed <kab...@linuxmail.org> wrote: > > > Taher - did you find any NULLs in the data? If you are still not able to > > make progress, let us know. > > > > On Sep 17 2019, at 8:30 am, Taher Koitawala <taher...@gmail.com> wrote: > > > Sure Gary, Let me check if i can find any nulls in there > > > > > > On Tue, Sep 17, 2019 at 1:28 AM Gary Li <yanjia.gary...@gmail.com> > > wrote: > > > > Hello, I have seen this exception before. In my case, if the > > precombine key > > > > of one entry is null, then I will have this error. I'd recommend > > checking > > > > if there is any row has null in *last_update.* > > > > > > > > Best, > > > > Gary > > > > > > > > > > > > On Mon, Sep 16, 2019 at 12:32 PM Kabeer Ahmed <kab...@linuxmail.org> > > > > wrote: > > > > > > > > > Taher, > > > > > Let me spin a test for you to test similar scenario and let me > revert > > > > back > > > > > to you. > > > > > On Sep 16 2019, at 2:09 pm, Taher Koitawala <taher...@gmail.com> > > wrote: > > > > > > Hi Kabeer, hive table has everything as a string. However when > > fetching > > > > > > data, the spark query is > > > > > > .sql(String.format("select contact_id,country,cast(last_update as > > > > > > TIMESTAMP) as last_update from %s",hiveTable)) > > > > > > > > > > > > On Mon, Sep 16, 2019 at 6:18 PM Kabeer Ahmed < > kab...@linuxmail.org > > > > > > > > wrote: > > > > > > > Is last_update a timestamp? Can you please throw the hive > schema > > that > > > > > > > > > > > > > > > > you > > > > > > > are using to create table. You could run show create table > > > > > > > > > > > > > > > > <table_name> and > > > > > > > send us the output please? > > > > > > > > > > > > > > On Sep 16 2019, at 1:32 pm, Taher Koitawala < > taher...@gmail.com> > > > > > wrote: > > > > > > > > Hi Kaber, Same issue when last_update is converted to long. > > > > > > > > > > > > > > > > HoodieSparkSQLWriter: Registered avro schema : { > > > > > > > > "type" : "record", > > > > > > > > "name" : "s3_master_contacts_list_hudi_record", > > > > > > > > "namespace" : "hoodie.s3_master_contacts_list_hudi", > > > > > > > > "fields" : [ { > > > > > > > > "name" : "contact_id", > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > }, { > > > > > > > > "name" : "country", > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > }, { > > > > > > > > "name" : "last_update", > > > > > > > > "type" : [ "long", "null" ] > > > > > > > > } ] > > > > > > > > } > > > > > > > > > > > > > > > > On Mon, Sep 16, 2019 at 4:17 PM Kabeer Ahmed < > > kab...@linuxmail.org > > > > > > > wrote: > > > > > > > > > Taher, > > > > > > > > > This error of field not found exception with HUDI is mostly > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > because of > > > > > > > > > > > > > > > > > > > > > > 2 > > > > > > > > > cases: > > > > > > > > > The data types of the fields do not match with the types > > listed > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > in > > > > > hive > > > > > > > > > tables. > > > > > > > > > > > > > > > > > > The field may really not be preset - which doesnt seem to > be > > your > > > > > case. > > > > > > > > > I looked into the schema in your log which is below. > > Basically > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > most of > > > > > > > > > > > > > > > > > > > > > > the > > > > > > > > > items seem to be string but I am not sure what are their > > types > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > that you > > > > > > > > > have defined in Hive. If you look into Hive table > > definition, you > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > may > > > > > > > > > > > > > > > > > > > > > > find > > > > > > > > > the bug soon. > > > > > > > > > > > > > > > > > > On another note, if you are still struggling; then you > > should try > > > > > to > > > > > > > start > > > > > > > > > with a very small example and keep building it. A ready > made > > code > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > copy > > > > > > > > > > > > > > > > > > > > > > is > > > > > > > > > at: > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/incubator-hudi/issues/859#issuecomment-527316262 > > > > > > > > > ( > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://link.getmailspring.com/link/76e27aed-a21c-4d8d-abd6-92e7c2a0c...@getmailspring.com/0?redirect=https%3A%2F%2Fgithub.com%2Fapache%2Fincubator-hudi%2Fissues%2F859%23issuecomment-527316262&recipient=ZGV2QGh1ZGkuYXBhY2hlLm9yZw%3D%3D > > > > > > > ) > > > > > > > > > written by Vinoth. You must take that small example build > it > > up > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > and > > > > > > > > > > > > > > > > > > > > > > then > > > > > > > > > relate to your own. > > > > > > > > > Let us know if this still doesnt work for you. > > > > > > > > > Thanks > > > > > > > > > Kabeer. > > > > > > > > > > > > > > > > > > > 19/09/16 10:09:26 INFO HoodieSparkSQLWriter: Registered > > avro > > > > > schema > > > > > > > : { > > > > > > > > > > "type" : "record", > > > > > > > > > > "name" : "s3_master_contacts_list_hudi_record", > > > > > > > > > > "namespace" : "hoodie.s3_master_contacts_list_hudi", > > > > > > > > > > "fields" : [ { > > > > > > > > > > "name" : "contact_id", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "phone_number", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "encrypted_phone_number", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "phone_number_hash", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "first_name", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "last_name", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "email_id", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "encrypted_email_id", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "email_id_hash", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "email_id_1", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "encrypted_email_id_1", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "email_id_1_hash", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "e_domain", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "account_id", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "company", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "company_1", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "flc", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "flc_1", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "flc_trim", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "fln", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "title", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "title_hash", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "address", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "zip_code", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "country", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "city", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "website", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "website_1", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "timezone", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "address_2", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "state_province", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "employees", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "employee_range", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "rev_range", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "std_rev_range", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "company_revenue", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "sic_code", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "nic_code", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "primary_industry", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "primary_industry_1", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "standard_primary_industry", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "primary_db_source", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "last_r8_email_open", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "last_r8_email_click", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "last_zd_email_open", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "last_zd_email_click", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "last_phone_verified", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "last_lead_verified", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "email_status", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "last_email_status_updated_at", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "is_firmographically_validated", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "last_firmographically_validated_at", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "is_demographically_validated", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "dq_reason", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "dq_subreason", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "dq_date", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "last_demographically_validated_at", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "public_profile_link", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "employee_profile_link", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "le_company_id", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "company_external_entity_id", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "le_contact_id", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "contact_external_entity_id", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "asset_1", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "asset_2", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "qc_comments", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "remark", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "tagging", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "sub_tagging", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "old_employees", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "old_revenue", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "old_company_revenue", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "old_primary_industry", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "updated_job_title", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "is_suppressed", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "is_archived", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "is_phone_valid", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "creation_date", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > }, { > > > > > > > > > > "name" : "last_update", > > > > > > > > > > "type" : [ "string", "null" ] > > > > > > > > > > } ] > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Sep 16 2019, at 11:39 am, Taher Koitawala < > > taher...@gmail.com > > > > > > > wrote: > > > > > > > > > > Hi All, > > > > > > > > > > I currently have a Spark-Hudi Job[1] running on EMR > > emr-5.23.0 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > which > > > > > > > > > > > > > > > > > > > > > > > > > > > reads a Hive CSV table and writes the table to a Hudi > > Dataset. > > > > The > > > > > > > Spark > > > > > > > > > job has a last_update column set as a precombin key. > However, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > when > > > > > > > > > > > > > > > > > > > > > > running > > > > > > > > > the job I get the following error > > > > > > > > > > > > > > > > > > > > Exception: > > > > > > > > > > WARN TaskSetManager: Lost task 2.0 in stage 1.0 (TID 3, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ip-10-10-10-10, > > > > > > > > > > > > > > > > > > executor 1): com.uber.hoodie.exception.HoodieException: > > > > > > > last_update(Part > > > > > > > > > -last_update) field not found in record. Acceptable fields > > were > > > > > > > > > :[contact_id, ..........................., last_update] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > What I don't understand is why HUDI is throwing the > > exception > > > > > even > > > > > > > when > > > > > > > > > HUDI found the column in acceptable fields. I am using > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hoodie-0.4.5 > > > > > > > > > > > > > > > > > > > > > > found > > > > > > > > > the same issue on hoodie-0.4.6. > > > > > > > > > > > > > > > > > > > > For more info, the entire log file has been attached > below. > > > > > > > > > > 1: sparkSession.sqlContext() > > > > > > > > > > .sql("select * from %s",hiveTable) > > > > > > > > > > .write() > > > > > > > > > > .format("com.uber.hoodie") > > > > > > > > > > .option("path",s3Path) > > > > > > > > > > > > > > > > .option(DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY(),"contact_id) > > > > > > > > > > > > > > > > > > > > > > .option(DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY(),"country") > > > > > > > > > > > > > > > > > > > > > > > > .option(DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY(),"last_update") > > > > > > > > > > > .option(HoodieWriteConfig.TABLE_NAME,"s3_hudi_hive_table") > > > > > > > > > > .mode(SaveMode.Overwrite) > > > > > > > > > > .saveAsTable("s3_hudi_hive_table"); > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Regards, > > > > > > > > > > Taher Koitawala > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >