Hi Steve,
sorry for such delay.
I like your ideas, I think this is the right direction. varint8 e varint16
could be a good way to save space, but we should consider when this slows
down some use cases, like partial field loading.
About the POC you created I think it would be much more useful if you play
with real documents. It's easy and you could push it to a separate branch
to let to us and other developers to contribute & test. WDYT?
Follow these steps:
(1) create your serializer
This is the skeleton of the class to implement:
public class BinaryDocumentSerializer implements ORecordSerializer {
public static final String NAME = "binarydoc";
// UN-MARSHALLING
public ORecordInternal<?> fromStream(final byte[] iSource) {
}
// PARTIAL UN-MARSHALLING
public ORecordInternal<?> fromStream(final byte[] iSource, final
ORecordInternal<?> iRecord, String[] iFields) {
}
// MARSHALLING
public byte[] toStream(final ORecordInternal<?> iSource, boolean
iOnlyDelta) {
}
}
(2) register your implementation
ORecordSerializerFactory.instance().register(BinaryDocumentSerializer.NAME,
new BinaryDocumentSerializer());
(3) create a new ODocument subclass
Then create a new class that extends ODocument but uses your implementation:
public class BinaryDocument extends ODocument {
protected void setup() {
super.setup();
_recordFormat =
ORecordSerializerFactory.instance().getFormat(BinaryDocumentSerializer.NAME);
}
}
(4) Try it!
And now try to create a BinaryDocument, set fields and call .save(). The
method BinaryDocumentSerializer.toStream() will be called.
Lvc@
On 18 February 2014 06:08, Steve <[email protected]> wrote:
>
> The point is: why should I store the field name when I've declared that
> a class has such names?
>
>
> Precisely. But I don't think you need to limit it to the declarative
> case... i.e. schema-full. By using a numbered field_id you cover
> schema-full, schema-mixed and schema-free cases with a single solution.
> There are two issues here... Performance and storage space. Arguably
> improving storage space also improves performance in a bigdata context
> because it allows caches to retain more logical units in memory.
>
>
> I've been having a good think about this and I think I've come up with a
> viable plan that solves a few problems. It requires schema versioning.
>
> I was hesitant to make this suggestion as it introduces more complexity in
> order to improve compactness and unnecessary reading of metadata. However
> I see from you original proposal that the problem exists there as well.:
>
> *Cons:*
>
> - *Every time the schema changes, a full scan and update of record is
> needed*
>
> The proposal is that record metadata is made of 3 parts + a meta-header
> (which in most cases would be 2-3 bytes. Fixed length schema declared
> fields, variable length schema declared fields and schema-less fields. The
> problem as you point out with a single schema per class is that if you
> change the schema you have to update every record. If you insert a field
> before the last field you would likely have to rewrite every record from
> scratch.
>
> First a couple of definitions:
>
> Definitions:
>
> varint8: a standard varint that is built from any number of 1 byte
> segments. The first bit of each segment is set to 1 if there is a
> subsequent segment. A number is constructed by concatenating the last 7
> bits of each byte. This allows for the following value ranges:
> 1 byte : 127
> 2 bytes: 16k
> 3 bytes: 2m
> 4 bytes: 268m
>
> varint16: same as varint8 but the first segment is 16 bits and all
> subsequent are 8 bits
> 2 bytes: 32k
> 3 bytes: 4m
> 4 bytes: 536m
>
> nameId: an int (or long) index from a field name array. This index could
> be one per JVM or one per class. Getting the field name using the nameId
> is a single array lookup. This is stored on disk as a varint16 allowing
> 32k names before we need to use a 3rd byte for name storage.
>
> I propose a record header that looks like this:
>
> version:varint8|header_length:varint8|variable_length_declared_field_headers|undeclared_field_headers
>
> Version is the schema version and would in most cases be only 1 byte. You
> would need 128 schema changes to make it 2 bytes. This proposal would
> require a cleanup tool that could scan all record and reset them all to
> most recent schema version (at which point version is reset to 0). But it
> would be necessary on every schema change. The user could choose if and
> when to run it. The only time you would need to do a full scan would be if
> you are introducing some sort of constraint and needed to validate that
> existing records don't violate the constraint.
>
> When a new schema is generated the user defined order of fields is stored
> in each field's Schema entry. Internally the fields are rearranged so that
> all fixed length fields come first. Because the order and length of fields
> is known by the schema there is no need to store offset/length in the
> record header.
>
> Variable length declared fields need only a length and offset and the rest
> of the field meta data is determined by the schema.
>
> Finally undeclared (schema-less) fields require additional header data:
> nameId:varint16|dataType:byte?|offset:varint8|length:varint8
>
> I've attached a very rough partial implementation to try and demonstrate
> the concept. It won't run because a number of low level functions aren't
> implemented but if you start at the Record class you should be able to
> follow the code through from the read(int nameId) method. It demonstrates
> how you would read a schema/fixed, schema/variable and non-schema field
> from the record using random access.
>
> I think I've made one significant mistake in demo code. I've used varints
> to store offset/length for schema-variable-length fields. This means you
> cannot find the header for one of those field without scanning that entire
> section of the header. The same is true for schema-less however in this
> case it doesn't matter since we don't know what fields are there (or the
> order) from the schema we have no option but to scan that part of the
> header to find the field metadata we are looking for.
>
> The advantage though of storing length as a varint is that perhaps in a
> majority of cases field length is going to be less than 127 bytes which
> means you can store it in a single byte rather than 4 or 8 for an int or
> long.
>
> We have a couple of potential tradeoffs to consider here (only relavent to
> the schema declared variable length fields). By doing a full scan of the
> header we can use varints with impunity and can gain storage benefits from
> it. We can also dispense with storing the offset field altogether as it
> can be calculated during the header scan. So potentially reducing the
> header entry for each field from 8 bytes (if you use int) to as little as
> 1. Also we remove a potential constraint on maximum field length. On the
> other hand if we use fixed length fields (like int or long) to store
> offset/length we gain random access in the header.
>
> I can see two edge cases where this sort of scheme would run into
> difficulties or potentially create a storage penalty. 1) a dataset that
> has a vast number of different fields. Perhaps where the user is for some
> reason using the field name as a kind of meta-data which would increase the
> in-memory field_name table and 2) Where a user has adopted the (rather
> hideous) mongoDB solution of abbreviating field names and taken it to the
> extreme of a single character field name. In this case my proposed 16 bit
> minimum nameIndex size would be 8 bits over what could be achieved.
>
> The first issue could be dealt with by only by making the tokenised field
> name feature available only in the case where the field is declared in
> schema (basically your proposal). But would also require a flag on
> internally stored field_name token to indicate if it's a schema token or
> schema-less full field name. It could be mitigated by giving an option for
> full field_name storage (I would imagine this would be a rare use case).
>
> The second issue (if deemed important enough to address) could also be be
> dealt with by a separate implementation of something like IFieldNameDecoder
> that uses an 8 bit segment and asking the user to declare a cluster/class
> as using that if they have a use case for it.
>
--
---
You received this message because you are subscribed to the Google Groups
"OrientDB" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.