By the way, the ADTree I want to build is not just a mirror of the data on disk. The data on disk is a list of events (aka instances), one per line, with a bunch of feature-value pairs on each line describing an event/instance. An ADtree is a concise way of storing counts of how often subsets of feature-values co-occured. A naive multi-dimensional array would be exponential in size with respect to the number of unique feature-value pairs. An ADTree is much much more concise than a naive multi-d array, but will still usually be much larger than the raw data on disk. However, it is extremely inefficient to try and count feature-value cooccurences directly from the raw data. Thus my interest in ADTrees. Since ADTrees are a kind of summary of the raw data, it doesn't make sense to read only part of the raw data file and build an ADTree for only that portion.

I have no doubt that the C++ code I'm using is as efficient as it could be. (I didn't write it, it was loaned to me by the inventer of ADTrees.) However, the current code is limited to features with a maximum arity of 50. Some of my features have arities on the order of 30,000-40,000 possible values. So I'm exploring how to adjust the structure of ADTrees to deal with high-arity attributes without losing their size/speed benefits. One possibility is to store only the top 50 most frequent values at any given level of the tree, plus have a 'misc.' catch-all value. This would be much more feasible than storing all 30,000 values, but it means that I might have many (thousands or 10s of thousands of unique record types (aka, arity lists). That's why I'm interested in understanding how much space it takes to store a record's arity. Is it linear in the number of attributes?

My record arities are NOT dynamic in the sense that I will be doing Adjoin operations. However, I could possibly end up with a structure that has many different arities in it. And during construction I will be using variables to refer to features (ie. Tree.Feature). Doesn't this mean the arities will need to be around during runtime, not just compilation time? Once the tree is built, I would want to pickle it and reuse it later to lookup counts. I would not be changing it at all once it was fully constructed.

How much space do dictionaries take compared to records? Are they the same size? I thought I read in a past posting somewhere that dictionaries had linear access time to feature fields, rather than constant time access. That would be an important difference to me, because I do care a lot about lookup time efficiency. Is that true?

Thanks for all your help so far,
Irene


Date: Tue, 27 Sep 2005 10:02:13 +0200
From: Raphael Collet <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>>
Subject: Re: memory space efficiency of records
To: [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>
Message-ID: <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed

Irene Langkilde Geary wrote:
> Thanks!  This is what I wanted to know.  It is especially helpful to
> understand that having records with different arities can have an impact.
>
> How much space does each arity structure take?

That's a (usually small) hash table, which maps each field name to an
index.  It's very reasonable, in fact.  Note that in the case of tuples,
the arity is just a number (the tuple's width).

The only problem with arities is that the current implementation does
not garbage collect them.  This is why we recommend to avoid creating
records with operations like Adjoin, etc.  If record arities are very
dynamic in your program, it is safer to use dictionaries, which are
plain hash tables (and gc'ed).

Cheers,
raph
-------------------------------------------------------------
Date: Tue, 27 Sep 2005 10:02:24 +0200
From: Filip Konvi?ka
<[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>>
Subject: Re: memory efficiency of records
To: [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>
Message-ID: <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>>
Content-Type: text/plain; charset=UTF-8; format=flowed

>> if your only concern is memory usage, then stick to C/C++.
>
> I disagree.  Data structures in Mozart are quite memory efficient.
> Lists use two
> words per list element.  Records take N+2 words.  Tuples take N+1 words.
> Dictionaries are especially efficient: they grow and shrink as
> necessary.  Data
> structures have good implementations in Mozart.   There is really no
> reason to
> reinvent the wheel in C++.

I think that there are a few issues with Oz here: arities and limitation
to the basic Oz types (yes, new ones can be implemented in C++ ;-)),
should some of the non-builtin data structures seem more fit.

On the contrary, the atom table may help reduce space occupied by
strings if they occur repeatedly, and arities may help speed up lookup
(but both can be implemented in C++ easily - e.g. using a hash_map).

But the point of my message was not to claim that Oz data representation
is inefficient!! (And actually I think that it is.) I just wanted to
point out that when you get a 1GB memory footprint for 250MB data, it is
a sign of poor implementation in C++, and that some minor corrections of
the existing code might bring the desired results faster than rewriting
the thing to Oz.

> The only reason to use C++ is for performance, and even there usually
> for only
> a small part of the program (i.e., write it all in Oz, profile it, and
> rewrite a small
> part in C++).  What's more, some of the more sophisticated parts of Oz,
> such
> as the lightweight threads and the constraint solver, are already
> written in C++!
>
> The lesson is don't give up too quickly on high-level languages.

Yes, but the question was strictly about minimal memory footprint. I
suggested to invest some limited effort into "optimization" of the
existing code rather than rewriting the whole thing to Oz, hoping that a
straightforward rewrite of the C++ code in Oz would consume less memory.
I'm in no case a fan of reinventing the wheel, but I'm no fan of
rewriting existing code when there is no reason either.

Two more comments:
1. IMHO C++ has become quite high-level in recent years, if you stick to
the standard containers.
2. Oz is cool - I think we'll agree on this :-)

I think that although you can achieve (I think that with no effort)
better memory behavior in C++ than Oz (not only consumption, but also
fragmentation etc.), Oz may be more suitable for some operations on the
data. But this is beyond the topic, isn't it?

Cheers,
Filip

_________________________________________________________________________________
mozart-users mailing list                               
[email protected]
http://www.mozart-oz.org/mailman/listinfo/mozart-users

Reply via email to