Re: Add datastore for Elasticsearch. Outreachy Week 10 Report

2021-02-15 Thread Maria Podorvanova
Hi John,

Thank you for your answers.

1) The type of the Elasticsearch "_id" field is string. I am not sure that
will fix the problem if I just copy the "_id" field contents as "_id" can
still be an arbitrary string value (i.e. not necessarily an integer).

2) Elasticsearch does not support partitioning, so I will leave the single
partition implementation.

Regards,
Maria

On Tue, 16 Feb 2021 at 09:14, John Mora  wrote:

> Hi Maria,
>
> Thanks for the update.
>
> 1) I think you can copy the content from _id to a manually created field
> let's say 'gora_id' using copy_to.
>
>
> https://www.elastic.co/guide/en/elasticsearch/reference/current/copy-to.html
>
> But, I have not try it yet, I am not sure if this will work.
>
> Alternatively, you can manually copy the value of the key to a field that
> can be range queried in the put method of the datastore.
>
> 2) In some databases you can split your data into partitions, generally
> defining ranges for the primary key.
>
> Kudu is an example of this:
> https://kudu.apache.org/docs/schema_design.html#range-partitioning
>
> In this case, the getPartitions should split a query using the existing
> partition ranges:
> Kudu example:
>
> https://github.com/apache/gora/blob/master/gora-kudu/src/main/java/org/apache/gora/kudu/store/KuduStore.java#L383
>
> If the database does not support partitioning this method only return a
> single partition (the whole table/collection).
> This is probably the implementation that you saw.
>
> I think Elasticsearch does not support partitioning, in that case your
> implementation is fine, but I am not an expert in Elasticsearch.
>
> Best,
> John
>
> El sáb, 13 feb 2021 a las 0:15, Maria Podorvanova (<
> podorvanova.ma...@gmail.com>) escribió:
>
>> Hi,
>>
>> Report #10
>> Week 10: January, 7 - February, 13
>> Activities:
>> - Implemented newQuery method
>> - Implemented deleteByQuery method
>> - Used an Enum instead of literal strings for the Authentication Type
>> parameter
>> - Used parameterized logging instead of string concatenation
>> - Implemented execute method
>> - Implemented getPartitions method
>> - The following tests are passing now:
>>
>>1. testTruncateSchema
>>2. testDeleteSchema
>>3. testQueryWebPageQueryEmptyResults
>>4. testResultSize
>>5. testResultSizeStartKey
>>6. testResultSizeEndKey
>>7. testResultSizeWithLimit
>>8. testResultSizeStartKeyWithLimit
>>9. testResultSizeEndKeyWithLimit
>>10. testResultSizeKeyRangeWithLimit
>>
>> - Filled out and sent Outreachy internship feedback to Apache
>>
>> Here is the link to my code:
>> https://github.com/apache/gora/compare/master...podorvanova:gora-664.
>> Relevant commits are from February 10.
>>
>> Questions:
>>
>>1. This week I worked on query functionalities implementation. While
>>testing I found that Elasticsearch "_id" field does not support range
>>queries, which are required for deleteByQuery method. So I am a little
>>confused about what I should do in this case.
>>2. I roughly understand that getPartitions method is needed to
>>implement the Hadoop support. I looked through other modules and found 
>> that
>>the method is implemented the same way everywhere, so I did the same for
>>now. Could you tell me more about this method or maybe provide some
>>resources?
>>
>>
>> Regards,
>> Maria
>>
>


Re: Add datastore for Elasticsearch. Outreachy Week 10 Report

2021-02-15 Thread John Mora
Hi Maria,

Thanks for the update.

1) I think you can copy the content from _id to a manually created field
let's say 'gora_id' using copy_to.

https://www.elastic.co/guide/en/elasticsearch/reference/current/copy-to.html

But, I have not try it yet, I am not sure if this will work.

Alternatively, you can manually copy the value of the key to a field that
can be range queried in the put method of the datastore.

2) In some databases you can split your data into partitions, generally
defining ranges for the primary key.

Kudu is an example of this:
https://kudu.apache.org/docs/schema_design.html#range-partitioning

In this case, the getPartitions should split a query using the existing
partition ranges:
Kudu example:
https://github.com/apache/gora/blob/master/gora-kudu/src/main/java/org/apache/gora/kudu/store/KuduStore.java#L383

If the database does not support partitioning this method only return a
single partition (the whole table/collection).
This is probably the implementation that you saw.

I think Elasticsearch does not support partitioning, in that case your
implementation is fine, but I am not an expert in Elasticsearch.

Best,
John

El sáb, 13 feb 2021 a las 0:15, Maria Podorvanova (<
podorvanova.ma...@gmail.com>) escribió:

> Hi,
>
> Report #10
> Week 10: January, 7 - February, 13
> Activities:
> - Implemented newQuery method
> - Implemented deleteByQuery method
> - Used an Enum instead of literal strings for the Authentication Type
> parameter
> - Used parameterized logging instead of string concatenation
> - Implemented execute method
> - Implemented getPartitions method
> - The following tests are passing now:
>
>1. testTruncateSchema
>2. testDeleteSchema
>3. testQueryWebPageQueryEmptyResults
>4. testResultSize
>5. testResultSizeStartKey
>6. testResultSizeEndKey
>7. testResultSizeWithLimit
>8. testResultSizeStartKeyWithLimit
>9. testResultSizeEndKeyWithLimit
>10. testResultSizeKeyRangeWithLimit
>
> - Filled out and sent Outreachy internship feedback to Apache
>
> Here is the link to my code:
> https://github.com/apache/gora/compare/master...podorvanova:gora-664.
> Relevant commits are from February 10.
>
> Questions:
>
>1. This week I worked on query functionalities implementation. While
>testing I found that Elasticsearch "_id" field does not support range
>queries, which are required for deleteByQuery method. So I am a little
>confused about what I should do in this case.
>2. I roughly understand that getPartitions method is needed to
>implement the Hadoop support. I looked through other modules and found that
>the method is implemented the same way everywhere, so I did the same for
>now. Could you tell me more about this method or maybe provide some
>resources?
>
>
> Regards,
> Maria
>


Add datastore for Elasticsearch. Outreachy Week 10 Report

2021-02-12 Thread Maria Podorvanova
Hi,

Report #10
Week 10: January, 7 - February, 13
Activities:
- Implemented newQuery method
- Implemented deleteByQuery method
- Used an Enum instead of literal strings for the Authentication Type
parameter
- Used parameterized logging instead of string concatenation
- Implemented execute method
- Implemented getPartitions method
- The following tests are passing now:

   1. testTruncateSchema
   2. testDeleteSchema
   3. testQueryWebPageQueryEmptyResults
   4. testResultSize
   5. testResultSizeStartKey
   6. testResultSizeEndKey
   7. testResultSizeWithLimit
   8. testResultSizeStartKeyWithLimit
   9. testResultSizeEndKeyWithLimit
   10. testResultSizeKeyRangeWithLimit

- Filled out and sent Outreachy internship feedback to Apache

Here is the link to my code:
https://github.com/apache/gora/compare/master...podorvanova:gora-664.
Relevant commits are from February 10.

Questions:

   1. This week I worked on query functionalities implementation. While
   testing I found that Elasticsearch "_id" field does not support range
   queries, which are required for deleteByQuery method. So I am a little
   confused about what I should do in this case.
   2. I roughly understand that getPartitions method is needed to implement
   the Hadoop support. I looked through other modules and found that the
   method is implemented the same way everywhere, so I did the same for now.
   Could you tell me more about this method or maybe provide some resources?


Regards,
Maria