Re: Spark ACID compatibility

2021-06-22 Thread Steve Loughran
On Mon, 14 Jun 2021 at 19:07, Mich Talebzadeh 
wrote:

>
>
> Now I am trying to read it in Hive
>
> 0: jdbc:hive2://rhes75:10099/default> desc test.randomDataDelta;
> ++--+--+
> |col_name|  data_type   | comment  |
> ++--+--+
> | id | int  |  |
> | clustered  | int  |  |
> | scattered  | int  |  |
> | randomised | int  |  |
> | random_string  | varchar(50)  |  |
> | small_vc   | varchar(50)  |  |
> | padding| varchar(40)  |  |
> ++--+--+
> 7 rows selected (0.169 seconds)
> 0: jdbc:hive2://rhes75:10099/default>
>
> *select count(1) from test.randomDataDelta;Error: Error while processing
> statement: FAILED: Execution Error, return code 1 from
> org.apache.hadoop.hive.ql.exec.mr.MapRedTask. ORC split generation failed
> with exception: java.lang.NoSuchMethodError:
> org.apache.hadoop.fs.FileStatus.compareTo(Lorg/apache/hadoop/fs/FileStatus;)I
> (state=08S01,code=1)*
>
> I did a Google search and showed the error I raised three years ago
>
>
> https://user.hive.apache.narkive.com/Td3He6Vj/failed-execution-error-return-code-1-from-org-apache-hadoop-hive-ql-exec-mr-mapredtask-orc-split
>
> So it has not been fixed yet!
>

Looking at the commit log for FileStatus shows HADOOP-14683 touching
compareTo, which cross references
https://issues.apache.org/jira/browse/HIVE-17133 , which fixes are
regression in https://issues.apache.org/jira/browse/HADOOP-12209 which was
committed by, er, one ste...@apache.org, whoever they are (*).

Try to build hive with the patch, drop in the modified JAR and verify it
works, then confirm this on the hive JIRA. That will reassure reviewers
this patch is needed and correct.

steve


(*) hadn't seen that regression; we should maybe have fixed by reinstating
the old compareTo(Object) as an overloaded call, but it may have been
impossible Comparator has expectations.


Re: Spark ACID compatibility

2021-06-14 Thread Mich Talebzadeh
I think we are hitting an old bug.

tried it with

Hadoop 3.1.1
Hive 3.1.1
Spark 3.1.1

Try to create an ORC transactional table in Hive (PySpark)

  CREATE TABLE if not exists test.randomDataDelta(
   ID INT
 , CLUSTERED INT
 , SCATTERED INT
 , RANDOMISED INT
 , RANDOM_STRING VARCHAR(50)
 , SMALL_VC VARCHAR(50)
 , PADDING  VARCHAR(40)
)
  STORED AS ORC
  TBLPROPERTIES (






*"transactional" = "true",  "orc.create.index"="true",
"orc.bloom.filter.columns"="ID",  "orc.bloom.filter.fpp"="0.05",
"orc.compress"="SNAPPY",  "orc.stripe.size"="16777216",
"orc.row.index.stride"="1" )*


And populate it through Spark with random data

it works and can red it through Spark

starting at ID =  218 ,ending on =  236
Schema of delta table
root
 |-- ID: long (nullable = true)
 |-- CLUSTERED: double (nullable = true)
 |-- SCATTERED: double (nullable = true)
 |-- RANDOMISED: double (nullable = true)
 |-- RANDOM_STRING: string (nullable = true)
 |-- SMALL_VC: string (nullable = true)
 |-- PADDING: string (nullable = true)

+-+-+
|minID|maxID|
+-+-+
|1|  236|
+-+-+

Finished at
14/06/2021 19:02:43.43


Now I am trying to read it in Hive

0: jdbc:hive2://rhes75:10099/default> desc test.randomDataDelta;
++--+--+
|col_name|  data_type   | comment  |
++--+--+
| id | int  |  |
| clustered  | int  |  |
| scattered  | int  |  |
| randomised | int  |  |
| random_string  | varchar(50)  |  |
| small_vc   | varchar(50)  |  |
| padding| varchar(40)  |  |
++--+--+
7 rows selected (0.169 seconds)
0: jdbc:hive2://rhes75:10099/default>

*select count(1) from test.randomDataDelta;Error: Error while processing
statement: FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.mr.MapRedTask. ORC split generation failed
with exception: java.lang.NoSuchMethodError:
org.apache.hadoop.fs.FileStatus.compareTo(Lorg/apache/hadoop/fs/FileStatus;)I
(state=08S01,code=1)*

I did a Google search and showed the error I raised three years ago

https://user.hive.apache.narkive.com/Td3He6Vj/failed-execution-error-return-code-1-from-org-apache-hadoop-hive-ql-exec-mr-mapredtask-orc-split

So it has not been fixed yet!

HTH



   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 14 Jun 2021 at 16:29, Suryansh Agnihotri 
wrote:

> No this also does not work.
> Steps I followed.
> spark-sql:
> CREATE TABLE students (id int, name string, marks int) STORED AS ORC
> TBLPROPERTIES ('transactional' = 'true');
>
> hive-cli:
> created a students_copy table and inserted some values in it and did
> "INSERT OVERWRITE TABLE students select * from default.students_copy;"
> I am able to query both tables from hive-cli but not from spark (table
> students is created using spark )
>
> Thanks
>
> On Mon, 14 Jun 2021 at 20:07, Mich Talebzadeh 
> wrote:
>
>> Ok there were issues in the past with the ORC table read through Spark.
>>
>> If the ORC table is created through Spark I believe it will work
>>
>> Do a test. Create the ORC table through Spark first.
>>
>> Then do insert overwrite into that table through Hive cli from your Hive
>> created ORC table and see if you can access data in the new table through
>> Spark.
>>
>> HTH
>>
>>
>>
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Mon, 14 Jun 2021 at 15:19, Suryansh Agnihotri <
>> sagnihotri2...@gmail.com> wrote:
>>
>>> Table was created by hive (hive-cli) , format is orc. I am able to get
>>> data from hive-cli (hive return rows).
>>> But spark-sql/spark-shell does not return any rows.
>>>
>>> On Mon, 14 Jun 2021 at 19:26, Mich Talebzadeh 
>>> wrote:
>>>
 How the table was created in the first place, spark or Hive?

 Is this table an ORC table and does Spark or Hive return rows?

 HTH



view my Linkedin profile
 



 *Disclaimer:* Use it at your own risk. Any and all responsibility for
 any loss, 

Re: Spark ACID compatibility

2021-06-14 Thread Suryansh Agnihotri
No this also does not work.
Steps I followed.
spark-sql:
CREATE TABLE students (id int, name string, marks int) STORED AS ORC
TBLPROPERTIES ('transactional' = 'true');

hive-cli:
created a students_copy table and inserted some values in it and did
"INSERT OVERWRITE TABLE students select * from default.students_copy;"
I am able to query both tables from hive-cli but not from spark (table
students is created using spark )

Thanks

On Mon, 14 Jun 2021 at 20:07, Mich Talebzadeh 
wrote:

> Ok there were issues in the past with the ORC table read through Spark.
>
> If the ORC table is created through Spark I believe it will work
>
> Do a test. Create the ORC table through Spark first.
>
> Then do insert overwrite into that table through Hive cli from your Hive
> created ORC table and see if you can access data in the new table through
> Spark.
>
> HTH
>
>
>
>
>
>view my Linkedin profile
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 14 Jun 2021 at 15:19, Suryansh Agnihotri 
> wrote:
>
>> Table was created by hive (hive-cli) , format is orc. I am able to get
>> data from hive-cli (hive return rows).
>> But spark-sql/spark-shell does not return any rows.
>>
>> On Mon, 14 Jun 2021 at 19:26, Mich Talebzadeh 
>> wrote:
>>
>>> How the table was created in the first place, spark or Hive?
>>>
>>> Is this table an ORC table and does Spark or Hive return rows?
>>>
>>> HTH
>>>
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Mon, 14 Jun 2021 at 14:33, Suryansh Agnihotri <
>>> sagnihotri2...@gmail.com> wrote:
>>>
 Hi
 Does spark support querying hive tables which are transactional?
  I am using spark 3.0.2 / hive metastore 3.1.2 and trying to query the
 table but I am not able to see the data from the table , although *show
 tables *does list the table from hive metastore and desc table works
 fine but *select * from table* gives *empty result*.
 Does the later version of spark have the fix or is there another way to
 query?
 Thanks

>>>


Re: Spark ACID compatibility

2021-06-14 Thread Mich Talebzadeh
Ok there were issues in the past with the ORC table read through Spark.

If the ORC table is created through Spark I believe it will work

Do a test. Create the ORC table through Spark first.

Then do insert overwrite into that table through Hive cli from your Hive
created ORC table and see if you can access data in the new table through
Spark.

HTH





   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 14 Jun 2021 at 15:19, Suryansh Agnihotri 
wrote:

> Table was created by hive (hive-cli) , format is orc. I am able to get
> data from hive-cli (hive return rows).
> But spark-sql/spark-shell does not return any rows.
>
> On Mon, 14 Jun 2021 at 19:26, Mich Talebzadeh 
> wrote:
>
>> How the table was created in the first place, spark or Hive?
>>
>> Is this table an ORC table and does Spark or Hive return rows?
>>
>> HTH
>>
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Mon, 14 Jun 2021 at 14:33, Suryansh Agnihotri <
>> sagnihotri2...@gmail.com> wrote:
>>
>>> Hi
>>> Does spark support querying hive tables which are transactional?
>>>  I am using spark 3.0.2 / hive metastore 3.1.2 and trying to query the
>>> table but I am not able to see the data from the table , although *show
>>> tables *does list the table from hive metastore and desc table works
>>> fine but *select * from table* gives *empty result*.
>>> Does the later version of spark have the fix or is there another way to
>>> query?
>>> Thanks
>>>
>>


Re: Spark ACID compatibility

2021-06-14 Thread Suryansh Agnihotri
Table was created by hive (hive-cli) , format is orc. I am able to get data
from hive-cli (hive return rows).
But spark-sql/spark-shell does not return any rows.

On Mon, 14 Jun 2021 at 19:26, Mich Talebzadeh 
wrote:

> How the table was created in the first place, spark or Hive?
>
> Is this table an ORC table and does Spark or Hive return rows?
>
> HTH
>
>
>
>view my Linkedin profile
> 
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Mon, 14 Jun 2021 at 14:33, Suryansh Agnihotri 
> wrote:
>
>> Hi
>> Does spark support querying hive tables which are transactional?
>>  I am using spark 3.0.2 / hive metastore 3.1.2 and trying to query the
>> table but I am not able to see the data from the table , although *show
>> tables *does list the table from hive metastore and desc table works
>> fine but *select * from table* gives *empty result*.
>> Does the later version of spark have the fix or is there another way to
>> query?
>> Thanks
>>
>


Re: Spark ACID compatibility

2021-06-14 Thread Mich Talebzadeh
How the table was created in the first place, spark or Hive?

Is this table an ORC table and does Spark or Hive return rows?

HTH



   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Mon, 14 Jun 2021 at 14:33, Suryansh Agnihotri 
wrote:

> Hi
> Does spark support querying hive tables which are transactional?
>  I am using spark 3.0.2 / hive metastore 3.1.2 and trying to query the
> table but I am not able to see the data from the table , although *show
> tables *does list the table from hive metastore and desc table works fine
> but *select * from table* gives *empty result*.
> Does the later version of spark have the fix or is there another way to
> query?
> Thanks
>