Re: Kudu datastore reports

John Mora Sat, 13 Jul 2019 17:32:22 -0700

Hi all

I updated my report in the Wiki[1]. Also, I pushed my last commits to my
branch [2]. Please give it a look if you have time.


This week, I will be working in the getPartitions and deleteByQuery methods
and testing the other tests in the DataStoreTestBase class.

Please let me know if you have suggestions.

[1]
https://cwiki.apache.org/confluence/display/GORA/GORA-485+Apache+Kudu+datastore+for+Gora+Reports
[2] https://github.com/jhnmora000/gora/tree/GORA-485

Best,
John.

El mié., 10 jul. 2019 a las 16:17, John Mora (<[email protected]>)
escribió:

> Hi Alfonso,
>
> Thanks so much for your time and support for this project. I will work on
> your comments. Responses inline :)
>
>
> El mar., 9 jul. 2019 a las 16:38, Alfonso Nishikawa (<
> [email protected]>) escribió:
>
>> Hi, John.
>>
>> Sorry for the delay, I am changing work and I have been very busy :( I
>> will try to answer your questions :)
>>
>> *> In the Employee example there is a field called 'dateOfBirth'. I tried
>> to map that field with the UNIXTIME_MICROS datatype of Kudu (I intuitively
>> assumed this is a date.). However, in the java world the Employee field is
>> a Long value and the kudu datatype is a Timestamp. So, I was wondering
>> whether I should force the usage of the UNIXTIME_MICROS datatype for this
>> field or just use a LONG datatype in Kudu.*
>>
>> In Avro 1.8 were introduced "Logical Types" so there is a "date" type
>> with an underlying "int" [1]. It's the first time I read about because
>> until the last version upgrade of Avro this weren't there. I would suggest
>> to ignore "dates" and map dateOfBirth as long, since in any case -in avro-
>> the value is the unix epoch. After this first approach, a design
>> improvement would be great, though :)
>>
>> - Would be good to have in the mapping a "timestamp" type so KuduStore
>> converts between the Entity long field <-> Kudu timestamp storage?
>> - Is there any other approach?
>>
>
> I think that Entity long field <-> Kudu timestamp conversion that the best
> alternative right now. Because, I would add more compatible datatypes to
> the mapping parameters which users can use. And this conversion should not
> be dificult to implement in my opinion. Also, the new Date datatype of avro
> could be implemented in newer versions because it would need further
> analysis in other datastores too. I will work on that.
>
>
>>
>>
>> *> What is the Gora's policy regarding flush()? *
>> *> KuduClient has multiple flushing modes
>> <https://kudu.apache.org/apidocs/org/apache/kudu/client/SessionConfiguration.FlushMode.html>and
>> also can set time interval
>> <https://kudu.apache.org/releases/1.2.0/apidocs/org/apache/kudu/client/KuduSession.html#setFlushInterval-int->
>> for automatic flush.*
>> *> Should theses behaviors be configurable using gora.properties file? or
>> just use the default configurations.*
>>
>> What we do in HBase is configure an autoflush option in gora.properties
>> [2] which is used when instanced the Table, but at the same time we
>> implement the flush() method to force the flush [3]. I would suggest to
>> follow that example, but adding the flushing options of Kudu. What flushing
>> mode (and time interval if it applies) do you suggest?
>>
>
> Well,  IMHO the default flush mode (auto flush sync) will do the job for
> most use cases. But I will add a configuration in gora.properties for
> selecting the other modes and specifying a autoflush time  if needed  by
> the user.
>
>
>>
>> *> Also, while reviewing the datastore interface I noticed this method
>> 'getPartitions(Query<K, T> query)'. What is the expected behavior of this
>> method?, should I use the partition definition in the xml mapping file for
>> this?.*
>>
>> The method getPartitions(Query) is related to Hadoop. Apache Gora
>> integrates with Hadoop implementing a custom Map and Reduce that allows to
>> get/write Entities directly.
>> You can take a look at HBase's implementation [4], which relies 
>> o.a.h.hbase.mapreduce.TableInputFormatBase
>> [5] to compute the splits (start key---end key) with the location of the
>> split to create a colection of partitions [6].
>>
>> So, if Kudu is allowed to perform computation using local kudu splits,
>> then this method does the needed preparation to allow to "send the
>> computation to where the data is locally".
>>
>> In any case, you can see that:
>>
>>    - MongoDB store implementation does not implement splitting [7]
>>    - Cassandra store implementation does not implement splitting [8]
>>    - Aerospike store implementation does not implement splitting [9]
>>    - Accumulo store implementation* does* implement splitting [10]
>>
>> If Kudu has a method to get the different splits for a table and its
>> locations, then you will be able to implement the full feature.
>>
>> This is Hadoop related and it is not trivial. I haven't elaborated much,
>> so if you find you need more information let me know :)
>>
>>
>>
> I will check whether Kudu has these features in order to implement this
> method. If not I will use the default implementation found in other
> backends.
>
>
>> About Queries, what I can tell is that Hbase only implements "Start key"
>> + "End key" because it has only 2 operations: "get" and "scan", and the
>> querying is for "scan" operation, were you want an interval (or all) of the
>> rows. Does Kudu have more querying functionality?
>>
>>
> Yes, Kudu implements a Scanner for querying data among with conditional
> predicates for filtering. I am using those classes.
>
>
>> About other topic, I am trying to install Kudu in standalone (all in 1
>> node). Do you use a Cloudera installation or do you have a standalone
>> installation? How do you do it? I found some instructions, but they talk
>> about compiling Kudu [11]. I was looking for something like HBase, that it
>> is unzip + execute "hbase start".
>>
>>
> I am using an embedded mini-cluster which comes with compiled binaries and
> can be used with maven[1] for testing my code. Once I get it mature enough
> I think I will be testing the datastore with a docker container [2]. I
> could not find a unzip+execute bundle either and I am kinda noob for
> compiling it myself.
>
> [1]
> https://kudu.apache.org/docs/developing.html#_jvm_based_integration_testing
> [2] https://hub.docker.com/r/usuresearch/apache-kudu/
>
>
>> Good job and thank you!! :)
>>
>> Regards,
>>
>> Alfonso Nishikawa
>>
>>
>> [1] - https://avro.apache.org/docs/1.8.0/spec.html#Logical+Types
>> [2] -
>> https://github.com/apache/gora/blob/apache-gora-0.9/gora-hbase/src/main/java/org/apache/gora/hbase/store/HBaseStore.java#L175
>> [3] -
>> https://github.com/apache/gora/blob/apache-gora-0.9/gora-hbase/src/main/java/org/apache/gora/hbase/store/HBaseStore.java#L458
>> [4] -
>> https://github.com/apache/gora/blob/apache-gora-0.9/gora-hbase/src/main/java/org/apache/gora/hbase/store/HBaseStore.java#L472
>> [5] -
>> https://github.com/apache/gora/blob/apache-gora-0.9/gora-hbase/src/main/java/org/apache/gora/hbase/store/HBaseStore.java#L479
>> [6] -
>> https://github.com/apache/gora/blob/apache-gora-0.9/gora-hbase/src/main/java/org/apache/gora/hbase/store/HBaseStore.java#L517
>> [7] -
>> https://github.com/apache/gora/blob/apache-gora-0.9/gora-mongodb/src/main/java/org/apache/gora/mongodb/store/MongoStore.java#L533
>> [8] -
>> https://github.com/apache/gora/blob/apache-gora-0.9/gora-cassandra/src/main/java/org/apache/gora/cassandra/store/CassandraStore.java#L292
>> [9] -
>> https://github.com/apache/gora/blob/apache-gora-0.9/gora-aerospike/src/main/java/org/apache/gora/aerospike/store/AerospikeStore.java#L369
>> [10] -
>> https://github.com/apache/gora/blob/apache-gora-0.9/gora-accumulo/src/main/java/org/apache/gora/accumulo/store/AccumuloStore.java#L902
>> [11] - https://kudu.apache.org/docs/installation.html
>>
>>
>> El lun., 8 jul. 2019 a las 3:42, John Mora (<[email protected]>)
>> escribió:
>>
>>> Hi all.
>>>
>>> As every week I updated my report in the Wiki[1]. Also, I pushed my last
>>> commits to my branch [2]. Please give it a look if you have time.
>>>
>>> This week, I will be continue working in the Queries implementation,
>>> please reach me out if you have any suggestions.
>>>
>>> Also, while reviewing the datastore interface I noticed this method
>>> 'getPartitions(Query<K, T> query)'. What is the expected behavior of this
>>> method?, should I use the partition definition in the xml mapping file for
>>> this?.
>>>
>>> Cheers,
>>> John.
>>>
>>> [1]
>>> https://cwiki.apache.org/confluence/display/GORA/GORA-485+Apache+Kudu+datastore+for+Gora+Reports
>>> [2] https://github.com/jhnmora000/gora/tree/GORA-485
>>>
>>>
>>> El dom., 30 jun. 2019 a las 16:56, John Mora (<[email protected]>)
>>> escribió:
>>>
>>>> Hi all.
>>>>
>>>> I received my first evaluation from the Google Summer of Code program
>>>> with a positive result. Thanks so much for your support and confidence to
>>>> the project and me.
>>>>
>>>> I updated my report of this week in the Wiki[1]. Also, I pushed my last
>>>> commits to my branch [2].
>>>>
>>>> This week, I will be reviewing my the serialization/ deserialization
>>>> process in order to identify optimizations specific for Kudu. Because I
>>>> used a generic methods of other backends which probably could be better
>>>> tuned for kudu. Also, I will start working on the Queries implementation.
>>>>
>>>> BTW, I added a question to the wiki about Date types. Please give it a
>>>> look if you have time.
>>>>
>>>> [1]
>>>> https://cwiki.apache.org/confluence/display/GORA/GORA-485+Apache+Kudu+datastore+for+Gora+Reports
>>>> [2] https://github.com/jhnmora000/gora/tree/GORA-485
>>>>
>>>> Cheers,
>>>> John
>>>>
>>>> El jue., 27 jun. 2019 a las 21:02, John Mora (<[email protected]>)
>>>> escribió:
>>>>
>>>>> Hi Carlos.
>>>>>
>>>>> Thanks for the reminder. I submitted the form yesterday. :D
>>>>>
>>>>> Best,
>>>>> John.
>>>>>
>>>>> El jue., 27 jun. 2019 a las 17:34, carlos muñoz (<[email protected]>)
>>>>> escribió:
>>>>>
>>>>>> Hi John
>>>>>>
>>>>>> The first Google Summer of Code evaluation is due on June 28th.
>>>>>> Please make sure you submit your Mentors' evaluation on time.
>>>>>>
>>>>>> Regards,
>>>>>> Carlos
>>>>>>
>>>>>> El dom., 23 jun. 2019 a las 18:29, John Mora (<[email protected]>)
>>>>>> escribió:
>>>>>>
>>>>>>> Hi all.
>>>>>>>
>>>>>>> FYI, I updated my report of this week on the Wiki[1]. Also, I pushed
>>>>>>> my last commits to my branch [2].
>>>>>>>
>>>>>>> As I mentioned in the reports I would like to know how datastores
>>>>>>> deal with flush(), should it work always manually executed?.
>>>>>>>
>>>>>>> Finally, This week I will be implementing object
>>>>>>> serialization/deserialization in the methods put, get, delete, exists. 
>>>>>>> Do
>>>>>>> you have any suggestions on how to proceed with this task?.
>>>>>>>
>>>>>>> Footnote: Thanks for the feedback Carlos, I fixed the problem.
>>>>>>>
>>>>>>> [1]
>>>>>>> https://cwiki.apache.org/confluence/display/GORA/GORA-485+Apache+Kudu+datastore+for+Gora+Reports
>>>>>>> [2] https://github.com/jhnmora000/gora/tree/GORA-485
>>>>>>>
>>>>>>> Cheers,
>>>>>>> John
>>>>>>>
>>>>>>>
>>>>>>> El lun., 17 jun. 2019 a las 22:58, carlos muñoz (<
>>>>>>> [email protected]>) escribió:
>>>>>>>
>>>>>>>> Hi John
>>>>>>>>
>>>>>>>> Your last changes look good to me. Keep it up. But, I noticed that
>>>>>>>> you have created an Enumeration for datatypes, which is very similar 
>>>>>>>> to the
>>>>>>>> kudu-client's [2]. Probably you should replace [1] for [2] in order to
>>>>>>>> avoid code duplication.
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> https://github.com/jhnmora000/gora/blob/GORA-485/gora-kudu/src/main/java/org/apache/gora/kudu/mapping/Column.java#L76
>>>>>>>> [2] https://kudu.apache.org/apidocs/org/apache/kudu/Type.html
>>>>>>>>
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Carlos
>>>>>>>>
>>>>>>>> El sáb., 15 jun. 2019 a las 12:01, John Mora (<[email protected]>)
>>>>>>>> escribió:
>>>>>>>>
>>>>>>>>> Hi all.
>>>>>>>>>
>>>>>>>>> I updated my report of this week on the Wiki[1]. I noticed that my
>>>>>>>>> code is lacking some javadoc documentation I think I will be working 
>>>>>>>>> on
>>>>>>>>> that this week, also I would like to enable and check schema 
>>>>>>>>> management
>>>>>>>>> tests (createSchema, existsSchema, etc.).
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>> https://cwiki.apache.org/confluence/display/GORA/GORA-485+Apache+Kudu+datastore+for+Gora+Reports
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> John.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> El mar., 11 jun. 2019 a las 0:11, John Mora (<[email protected]>)
>>>>>>>>> escribió:
>>>>>>>>>
>>>>>>>>>> Hi Alfonso.
>>>>>>>>>>
>>>>>>>>>> Thanks so much for your feedback. I am working on your comments.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> John
>>>>>>>>>>
>>>>>>>>>> El lun., 10 jun. 2019 a las 16:11, Alfonso Nishikawa (<
>>>>>>>>>> [email protected]>) escribió:
>>>>>>>>>>
>>>>>>>>>>> Hi, John.
>>>>>>>>>>>
>>>>>>>>>>> Regarding your questions at the report [1]:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>    - How to represent partitioning configurations on the
>>>>>>>>>>>    mapping file.
>>>>>>>>>>>
>>>>>>>>>>> This was discussed in other emails, isn't it? :)
>>>>>>>>>>>
>>>>>>>>>>>    - KuduTestHarness requires the Maven plugin os-maven-plugin,
>>>>>>>>>>>    which needs Maven 3.1.1+, is it a problem for Apache Gora?
>>>>>>>>>>>
>>>>>>>>>>> I believe it is not a problem. My Ubuntu comes with 3.6.0, far
>>>>>>>>>>> from 3.1.1, and I assume everyone uses Maven 3 in a quite new 
>>>>>>>>>>> version :)
>>>>>>>>>>>
>>>>>>>>>>> [1] -
>>>>>>>>>>> https://cwiki.apache.org/confluence/display/GORA/GORA-485+Apache+Kudu+datastore+for+Gora+Reports
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>>
>>>>>>>>>>> Alfonso Nishikawa
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> El lun., 10 jun. 2019 a las 21:07, Alfonso Nishikawa (<
>>>>>>>>>>> [email protected]>) escribió:
>>>>>>>>>>>
>>>>>>>>>>>> Hi, John.
>>>>>>>>>>>>
>>>>>>>>>>>> Thank you!
>>>>>>>>>>>> Things I have seen:
>>>>>>>>>>>>
>>>>>>>>>>>> - The version of a maven dependency [1] should go on the
>>>>>>>>>>>> Dependency Management of the root pom [2]. Same for [3] and from 
>>>>>>>>>>>> there,
>>>>>>>>>>>> should not set the version there.
>>>>>>>>>>>> - Set test dependencies' scope to test, at [4] and from there.
>>>>>>>>>>>> - Set the indentation to 2 spaces for the pom [5]
>>>>>>>>>>>> - Missing "t" in "localhost" at [6].
>>>>>>>>>>>> - Port 13 for Kudu? That is "Daytime Protocol" RFC 867 and you
>>>>>>>>>>>> will need root permission to run it. The default port for kudu is 
>>>>>>>>>>>> 7051,
>>>>>>>>>>>> isn't it?
>>>>>>>>>>>> - I would ask you to add the same functionality to load the
>>>>>>>>>>>> mapping from configuration as in HBase's store [7] in you 
>>>>>>>>>>>> KuduStore [8].
>>>>>>>>>>>> This will have implications on your readMapping at [9], so take a 
>>>>>>>>>>>> look at
>>>>>>>>>>>> the one for HBase at [10]
>>>>>>>>>>>> - I know it is in other backends, but avoid RuntimeExceptions
>>>>>>>>>>>> (at least in Java since we have the checked ones) like in [11]. 
>>>>>>>>>>>> You can
>>>>>>>>>>>> wrap them in GoraException. An example is [12]
>>>>>>>>>>>>
>>>>>>>>>>>> And nothing more :)
>>>>>>>>>>>> Keep going, good job.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> [1] -
>>>>>>>>>>>> https://github.com/jhnmora000/gora/blob/GORA-485/gora-kudu/pom.xml#L98
>>>>>>>>>>>> [2] -
>>>>>>>>>>>> https://github.com/jhnmora000/gora/blob/GORA-485/pom.xml#L890
>>>>>>>>>>>> [3] -
>>>>>>>>>>>> https://github.com/jhnmora000/gora/blob/GORA-485/gora-kudu/pom.xml#L121
>>>>>>>>>>>> [4] -
>>>>>>>>>>>> https://github.com/jhnmora000/gora/blob/GORA-485/gora-kudu/pom.xml#L180
>>>>>>>>>>>> [5] -
>>>>>>>>>>>> https://github.com/jhnmora000/gora/blob/GORA-485/gora-kudu/pom.xml
>>>>>>>>>>>> [6] -
>>>>>>>>>>>> https://github.com/jhnmora000/gora/blob/GORA-485/gora-kudu/src/test/resources/gora.properties#L18
>>>>>>>>>>>> [7] -
>>>>>>>>>>>> https://github.com/jhnmora000/gora/blob/master/gora-hbase/src/main/java/org/apache/gora/hbase/store/HBaseStore.java#L92
>>>>>>>>>>>> [8] -
>>>>>>>>>>>> https://github.com/jhnmora000/gora/blob/GORA-485/gora-kudu/src/main/java/org/apache/gora/kudu/store/KuduStore.java#L53
>>>>>>>>>>>> [9] -
>>>>>>>>>>>> https://github.com/jhnmora000/gora/blob/GORA-485/gora-kudu/src/main/java/org/apache/gora/kudu/mapping/KuduMappingBuilder.java#L81
>>>>>>>>>>>> [10] -
>>>>>>>>>>>> https://github.com/jhnmora000/gora/blob/master/gora-hbase/src/main/java/org/apache/gora/hbase/store/HBaseStore.java#L822
>>>>>>>>>>>> [11] -
>>>>>>>>>>>> https://github.com/jhnmora000/gora/blob/GORA-485/gora-kudu/src/main/java/org/apache/gora/kudu/mapping/KuduMappingBuilder.java#L141
>>>>>>>>>>>> [12] -
>>>>>>>>>>>> https://github.com/jhnmora000/gora/blob/master/gora-hbase/src/main/java/org/apache/gora/hbase/store/HBaseStore.java#L268
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>>
>>>>>>>>>>>> Alfonso Nishikawa
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> El sáb., 8 jun. 2019 a las 20:26, John Mora (<
>>>>>>>>>>>> [email protected]>) escribió:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi all.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have just updated my weekly reports on Cwiki [1]. This next
>>>>>>>>>>>>> week I think I should be focusing on the create schema operation 
>>>>>>>>>>>>> and
>>>>>>>>>>>>> solving the issue of the partitioning configurations in the 
>>>>>>>>>>>>> mapping file.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Please let me know if you have suggestions, my last commits
>>>>>>>>>>>>> are available here [2]
>>>>>>>>>>>>>
>>>>>>>>>>>>> [1]
>>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/GORA/GORA-485+Apache+Kudu+datastore+for+Gora+Reports
>>>>>>>>>>>>> [2] https://github.com/jhnmora000/gora/tree/GORA-485
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>> John
>>>>>>>>>>>>>
>>>>>>>>>>>>>

Re: Kudu datastore reports

Reply via email to