Hi,
we should also consider how the issue of data/index consistency is
tackled
in AsterixDB [1]. It doesn’t automatically update indexes, but it
ensures
consistency and thus allows the optimizer to choose an index without
changing the result of the query.
The approach might not be the right now for VXQuery, but it would be
good to
take a look :)
Cheers,
Till
[1] http://dl.acm.org/citation.cfm?id=2806428
On 6 Jun 2016, at 20:52, Menaka Madushanka wrote:
Hello,
I'm sorry Preston. Here is the link for the image.
https://drive.google.com/file/d/0B-2mdAzfAj07Z0w4RVZ2SGFfTFk/view?usp=sharing
I came up with this approach thinking that, the index should be
updated
automatically if any of the xml file has been changed. (Without user
interference) And what I have added in the proposal was also updating
the
index automatically.
I didn't saw the new issue which was added by Steven about it,
https://issues.apache.org/jira/browse/VXQUERY-198.
As Steven mentioned, the updating process should be decided where,
only the
changed files (updated, deleted or inserted) should be updated in the
index.
Is there anything else that we will eventually want in a metadata
file?
I think that as we are trying to track the modified files, a content
based
checksum is the best way to do it. We can use last modified date and
check
it. But it's not fully reliable method depending only on single factor
which can also be changed based on the time of the user's machine.
Other than checksum value, I think we can store some info about the
relevant index of that file. So when updating the index, the process
will
be very easy. (I have to look whether it is possible)
When you say run a query, is this a UPDATE query or a SEARCH query? I
think at this point we only want to cause the update action to happen
for a UPDATE query. The overhead of update a query before searching
could be to much. Lets first get UPDATE working.
I thought this should be run in a Search query. (As I was not fully
aware
of the update index query) So, my suggestion was, when running a
search
query, it will first check for any file changes. If there were any,
update
the corresponding index and do the search on it. It's true as you
mentioned
it will have a huge overhead. So we can use this method in detecting
the
changed files and update the index in update query.
Thank you very much
Menaka
On 6 June 2016 at 03:02, Steven Jacobs <[email protected]> wrote:
In addition to Preston's comments, we also need to start thinking
about the
Lucene side. Once we know a file needs to be changed in the index,
how does
this change take place? Looking at how things are stored now will
help with
this.
Steven
On Sunday, June 5, 2016, Preston Carman <[email protected]> wrote:
As we consider creating a meta data file for each index, lets
consider
what other information could be stored with the index? What are the
types of functionality do we need to have a complete indexing story?
As I understand it, we support creating an index and searching using
that index. Would we want to show the user a list of indexes?
Menaka's
e-mail suggest we need a way to update an index. What other
queries/features should we support around indexes?
Indexing Features
* Create index
* Search using index
* Update index???
* List indexes???
* Delete index???
On Sat, Jun 4, 2016 at 10:18 PM, Menaka Madushanka
<[email protected] <javascript:;>> wrote:
Hi everyone,
I came up with an implementation plan for the $subject. This will
be
able to
detect file content changes as well as deletions and additions.
Methodology:
1. Generate checksum (MD5/ SHA) for each file. These checksum
values
will be
written to a single properties file in following format.
path_to_the_file=checksum_string
Is there anything else that we will eventually want in a metadata
file?
2.In the first time run, the checksum will be calculated and the
properties
file will be created.
Sounds good.
3. When running a query,
The properties file will be read and loaded in to memory.
The checksum values will be checked for each file.
If any modification is detected, the index will be updated and the
new
checksum value will be stored.
In the process of checking the checksum, the path of the file will
be
taken
by the file itself and retrieve the checksum for that file from
properties.
So, if any file insertion or deletion can be detected because we
consider
the actual file first.
When you say run a query, is this a UPDATE query or a SEARCH query?
I
think at this point we only want to cause the update action to
happen
for a UPDATE query. The overhead of update a query before searching
could be to much. Lets first get UPDATE working.
To make the process more clear, I have attached the flow diagram
herewith.
I do not see the diagram. Apache will only forward certain types of
attachments. Can you post a link to your diagram?
I'd be very happy to have any feedback on this approach.
Thank you very much
Menaka
--
Menaka Madushanka Jayawardena
Faculty of Engineering,
University of Peradeniyaya.
LinkedIn
TP:- 071 885 1183/ 071 350 5470
--
*Menaka Madushanka Jayawardena*
Faculty of Engineering, <http://www.pdn.ac.lk/eng>
University of Peradeniyaya.
LinkedIn <http://lk.linkedin.com/in/menakajayawardena>
TP:- 071 885 1183/ 071 350 5470