On Thu, Dec 06, 2018 at 03:54:46PM -0800, Jonathan Tan wrote:
> This makes sense - I thought I shouldn't mention the commit graph in the
> code since it seems like a layering violation, but I felt the need to
> mention commit graph in a comment, so maybe the need to mention commit
> graph in the code is there too. Subsequently, maybe the lookup-for-type
> could be replaced by a lookup-in-commit-graph (maybe by using
> parse_commit_in_graph() directly), which should be at least slightly
> faster.
That makes more sense to me. If we don't have a commit graph at all,
it's a quick noop. If we do, we might binary search in the list of
commits for a non-commit. But that's strictly faster than finding the
object's type (which involves a binary search of a larger list, followed
by actually accessing the type info).
> > In general, it would be nice if we had a more incremental API
> > for accessing objects: open, get metadata, then read the data. That
> > would make these kinds of optimizations "free".
>
> Would this be assuming that to read the data, you would (1) first need to
> read the metadata, and (2) there would be no redundancy in reading the
> two? It seems to me that for loose objects, you would want to perform
> all your reads at once, since any read requires opening the file, and
> for commit graphs, you just want to read what you want, since the
> metadata and the data are in separate places.
By metadata here, I don't mean the commit-graph data, but just the
object type and size. So I'm imagining an interface more like:
- object_open() locates the object, and stores either the pack
file/offset or a descriptor to a loose path in an opaque handle
struct
- object_size() and object_type() on that handle would do what you
expect. For loose objects, these would parse the header (the
equivalent of unpack_sha1_header()). For packed ones, they'd use the
object header in the pack (and chase down the delta bits as needed).
- object_contents() would return the full content
- object_read() could sequentially read a subset of the file (this
could replace the streaming interface we currently have)
We have most of the low-level bits for this already, if you poke into
what object_info_extended() is doing. We just don't have them packaged
in an interface which can persist across multiple calls.
With an interface like that, parse_object()'s large-blob check could be
something like the patch below.
But your case here is a bit more interesting. If we have a commit graph,
then we can avoid opening (or even finding!) the on-disk object at all.
So I actually think it makes sense to just check the commit-graph first,
as discussed above.
---
diff --git a/object.c b/object.c
index e54160550c..afce58c0bc 100644
--- a/object.c
+++ b/object.c
@@ -254,23 +254,31 @@ struct object *parse_object(struct repository *r, const
struct object_id *oid)
const struct object_id *repl = lookup_replace_object(r, oid);
void *buffer;
struct object *obj;
+ struct object_handle oh;
obj = lookup_object(r, oid->hash);
if (obj && obj->parsed)
return obj;
- if ((obj && obj->type == OBJ_BLOB && has_object_file(oid)) ||
- (!obj && has_object_file(oid) &&
- oid_object_info(r, oid, NULL) == OBJ_BLOB)) {
- if (check_object_signature(repl, NULL, 0, NULL) < 0) {
+ if (object_open(&oh, oid) < 0)
+ return NULL; /* missing object */
+
+ if (object_type(&oh) == OBJ_BLOB) {
+ /* this will call object_read() on 4k chunks */
+ if (check_object_signature_stream(&oh, oid)) {
error(_("sha1 mismatch %s"), oid_to_hex(oid));
return NULL;
}
+ object_close(&oh); /* we don't care about contents */
parse_blob_buffer(lookup_blob(r, oid), NULL, 0);
return lookup_object(r, oid->hash);
}
- buffer = read_object_file(oid, &type, &size);
+ type = object_type(&oh);
+ size = object_size(&oh);
+ buffer = object_contents(&oh);
+ object_close(&oh);
+
if (buffer) {
if (check_object_signature(repl, buffer, size, type_name(type))
< 0) {
free(buffer);