Re: Metadata changes

Till Westmann Mon, 14 Dec 2015 19:11:08 -0800

On 14 Dec 2015, at 18:55, Murtadha Hubail wrote:

I think the backward compatibility discussion goes beyond metadataindexes and a complete plan that considers everything in storageshould be developed to support upgrading and patching. Just as anexample when we did the repacking from edu.uci to org.apache, allexisting instances on edu.uci wouldn’t work on new binaries due toJava serialization on edu.uci classes.


Good point. Do you know if we fixed that or did we just leave it as-is?

Having said that, I would go with the right long term solution formetadata indexes which would’ve been a result of the backwardcompatibility plan if we had one.

I tend to agree here. I think that we’ll need a backwardscompatibility story, even if we choose to be schema-less for allmetadata.1) Even if the metadata is all flexible, we’ll be able to read the oldmetadata, but we’ll need to keep code around to read all versions ofthe metadata.2) If we need to change the file format for the data we’ll also need away to realize that (and that would probably affect the metadata aswell).

I think that it might be a good start to add version identifiers topersisted data structures, so that we’d at least be able todistinguish different versions (and potentially have the ability toprovide some migration - of needed).


Thoughts?

Cheers,
Till

On Dec 14, 2015, at 6:19 PM, Ildar Absalyamov<[email protected]> wrote:
As for general topic of backwards compatibility I think going“fully open” might be the best longterm solution.Once in a while the topic of changing metadata keeps reappearing andthere is no guarantee it will not strike back again. Opening upmetadata will release ourselves from burden of producing migrationtools and shipping them with the new version of the binaries withrevised catalog.The performance (mainly storage) impacts of that solution will betolerable especially considering how much data is usually stored inmetadata.Moreover, being big proponents of semi-structured data, it does makeperfect sense for us to eat our own dog food here.
On Dec 14, 2015, at 18:04, Ildar Absalyamov<[email protected]> wrote:
I guess the main argument for 2 would be eliminating broken metadatarecords prior to backwards compatibility cutoff.The last thing what we want to do is to be stuck with wrongimplementation for compatibility reasons. Once the functionalityneeded for 3 is there we can again introduce those indexes withoutbuilding sophisticated migration subsystem.
On Dec 14, 2015, at 17:55, Mike Carey <[email protected]> wrote:
SO - it seems like 3 is the right long-term answer, but not doablenow?(If it was doable now, it would obviously be the ideal choice ofthe three.)
What would be the argument for doing 2 as opposed to 1 for now?
As for the question of backwards compatibility, I actually didn'tsense a consensus yet.I would tentatively lean towards "right" over "backwardscompatible" for this change.
What are others thoughts on that?
(Soon we won't have that luxury, but right now maybe we do?)

On 12/14/15 3:43 PM, Steven Jacobs wrote:
We just had a UCR discussion on this topic. The issue is reallywith the
third "index" here. The code now is using one "index" to go in two
directions:
1) To find datatypes that use datatype A
2) To find datatypes that are used by datatype A.
The way that it works now is hacked together, but designed forperformance.
So we have three choices here:

1) Stick to the status quo, and leave the "indexes" as they are
2) Remove the Metadata secondary indexes, which will eliminate thehack but
cost some performance on Metadata
3) Implement the Metadata secondary indexes correctly as Asterixindexes.For this solution to work with our dataset designs, we will needto havethe ability to index homogeneous lists. In addition, we will havereverse
compatibility issues unless we plan things out for the transition.

What are the thoughts?


Orthogonally, it seems that the consensus for storing the datatype
dataverse in the dataset Metadata is to just add it as an openfield at
least for now. Is that correct?

Steven
On Mon, Dec 14, 2015 at 1:23 PM, Mike Carey <[email protected]>wrote:
Thoughts inlined:

On 12/14/15 11:12 AM, Steven Jacobs wrote:
Here are the conclusions that Ildar and I have drawn fromlooking at the
secondary indexes:
First of all it seems that datasets are local to node groups,but
dataverses can span node groups, which seems a little odd to me.
Node groups are an undocumented but to-be-exploited-somedayfeature thatallows datasets to be stored on less than all nodes in a givencluster. Aswe face bigger clusters, we'll want to open up that possibility.We willhopefully use them inside w/o having to make users manage themmanuallylike parallel DB2 did/does. Dataverses are really just anamespace thing,not a storage thing at all, so they are orthogonal to (andunrelated to)
node groups.
There are three Metadata secondary indexes:GROUPNAME_ON_DATASET_INDEX,
DATATYPENAME_ON_DATASET_INDEX, DATATYPENAME_ON_DATATYPE_INDEX

The first is used in only one case:
When dropping a node group, check if there are any datasetsusing this
node
group. If so, don't allow the drop
BUT, this index has a field called "dataverse" which is not usedat all.
This one seems like a waste of space since we do this almostnever. (Notmuch space, but unnecessary.) If we keep it it should become aproper
index.
The second is used when dropping a datatype. If there is adataset using
this datatype, don't allow the drop.
Similarly, this index has a "dataverse" which is never used.
You're about to use the dataverse part, right? :-) This indexseems like
it will be useful but should be a proper index.
The third index is used to go in two cases, using two differentideas of
"keys"
It seems like this should actually be two different indexes.
I don't think I understood this comment....
This is my understanding so far. It would be good to discusswhat the
"correct" version should be.
Steven
On Mon, Dec 14, 2015 at 10:12 AM, Steven Jacobs<[email protected]> wrote:
Hi all,
I'm implementing a change so that datasets can use datatypesfromalternate data verses (previously the type and set had to befrom the
same
dataverse). Unfortunately this means another change for DatasetMetadata
(which will now store the dataverse for its type).

As such, I had a couple of questions:
1) Should this change be thrown into the release branch, as itis another
Metadata change?
2) In implementing this change, I've been looking at theMetadatasecondary indexes. I had a discussion with Ildar, and it seemsthe threadon Metadata secondary indexes being "hacked" has been lost. Isthis alsosomething that should get into the release? Is there anyonecurrently
looking at it?

Steven
Best regards,
Ildar
Best regards,
Ildar

Re: Metadata changes

Reply via email to