moin Rob,

On 2011-10-01 03:54, Rob Freeman wrote:
> Bryan,
> 
> I did a quick search for your papers. Seems you're mostly looking
> for "canonical" dimensions for texts. Perhaps you did some categories
> for individual words earlier(?)

Yup; the only publication that came out of it was in 2005:
http://www.ling.uni-potsdam.de/~moocow/pubs/jurish2005hybrid_color.pdf
... the whole sparse pdl thing happened after that; it was intended to
be part of my dissertation, but I wound up getting funding for something
completely different (the canonical-forms-for-historical-text stuff)...
defense is Thursday ;-)

> I'm characterizing words (in terms of contexts, or texts), so kind
> of the inverse of characterizing texts.

Characterizing words by co-occurrence distributions was what I was
after, too -- the Big Idea was to induce (something like) syntactic
categories from the observed distributions.

> In particular what I'm doing which is new is trying to find a 
> principled way to syntactically combine these word level 
> representations, as vectors, without abstracting them into categories
> first.

Sounds very cool.  I think I heard Katrin Erk talk about something
similar at ACL 2010, but I've never tried anything like it myself.

> In detail it means I'm trying to build context vectors for
> unobserved pairs of words, by substituting contexts for observed
> pairs between their respective elements.

Sounds a bit like a brand of smoothing; although I'm not really sure
what "substituting contexts for observed pairs between their respective
elements" means here; unless maybe it's a sort of transitivity, e.g. you
have {ab,cd,be,de} and to compare "a" and "c" you compose the adjacency
relation to get {a(b)e,c(d)e} in order to get a shared attribute ("*e")
for "a" and "c"... I guess that would explain your interest in the
cross-product too, but I'm just guessing...

> It relies linguistically on the assumption that words with common 
> contexts tend to behave the same, so they'll have pairs which have 
> the same contexts etc.

Yes, very Firthian (actually very Leibnizian, but who's counting)?

> Anyway, this means I'm always searching for pairs of words which 
> occur in sequence in a text, so I can perform my pair search as a 
> parallel search indexed by word position in the text. It's linear on
>  text size, which is still a big win over exponential on vector
> size.

... in this context, I've got a pdl-ified corpus representation floating
about somewhere if you're interested; basically it relies on a perl hash
to map word types to integer ids and stores the corpus as a flat vector
(with sentence boundaries).  It's pretty easy to get co-occurrence
statistics from that (see e.g. PDL::Ngrams), and they wrap nicely into
PDL::CCS::Nd.

> It sounds very similar to your linear parallel search on operands' 
> index values. Your operands' index values must have similar meaning.
>  There's an assumption that one sorting order is appropriate for all
>  searches.

"Operands" in the sense of my last mail just meant "PDL::CCS::Nd sparse
n-d piddles passed as arguments to some arithmetic operation".  In my
application, the "index values" of these pdls corresponded to word
types, so e.g. {the->0,cat->1,is->2,on->3,mat->4}, the sentence "the cat
is on the mat" is the vector [0 1 2 3 0 4], the sort order is just
lexicographic order (so "the cat" < "the mat"). That means you've got to
specify either full index-vectors to search ("the cat" vs. "* cat") or
some prefix of the dimensions ("the *").  There is some support for
re-sorting on a different dimension (via reorder() etc) in order to
search for e.g. "* cat" -- I usually just copied the sparse pdl's
indices, slice()d them to reorder (make "* cat" searchable as "cat *"),
and re-sorted (qsortvec()) to do this sort of thing.  The module tries
to do that implicitly if and when required, but if you need to do it
often, you're better off caching the inverted matrices.

You could use the same strategy to make an occurrence-list in a flat
corpus id-vector look like a single huge sparse pdl, with as many
dimensions as your most frequent word type has tokens in the corpus; if
that's what you're after -- only the 0th dimension (word type) would be
"interesting", and you get a list of tokens (as indices into the flat
corpus vector) for any word type in O(log(Types)), which is
substantially faster than a linear search over all tokens in the corpus.
You can actually get this particular operation down to O(1) if you use
the ptr() method to get a Harwell-Boeing pointer for the dimension of
interest.

> I'm not sure if I did better on size than O(Nnz**2). I'm pretty sure
> that working with arrays was just too big for me, My limited
> programming knowledge didn't reach to a way to use them sparsely.

I think maybe you misunderstood.  The PDL::CCS::Nd arrays are basically
just the ($indices,$values) pairs I mentioned, plus some flags and
administrative data.  Both $indices and $values are themselves pdls; so e.g.

 use PDL;
 use PDL::CCS::Nd ':vars';
 my $sp = toccs( pdl( [[0,0,1],[2,0,0]] ) );
 print "indices=", $sp->[$WHICH], "values=", $sp->[$VALS];

prints:

indices=
[
 [0 1]
 [2 0]
]
values=[2 1 0]

> I had to go to Perl hashes to get something sparse. I thought you'd 
> done the same. Am I now understanding rightly that your CCS's are 
> arrays.

Yes, they're arrays of pdls. These arrays are bless()ed into a perl
class which tries to behave like a pdl, but really isn't.

> If your CCS's are a sparse implementation with arrays that might be a
> win for me. Would they be a lot smaller than Perl hashes, do you
> think?

Definitely, since you eliminate the overhead needed for gazillions of
perl scalars.  And faster, too; at least in every case I've tested.

marmosets,
        Bryan

> -Rob
> 
> On Fri, Sep 30, 2011 at 4:50 PM, Bryan Jurish 
> <[email protected]> wrote:
>> moin Rob,
>> 
>> On 2011-09-29 14:16, Rob Freeman wrote:
>>> Hi Bryan,
>>> 
>>> I haven't seen this appear on the list yet. Are you sure you sent
>>> from the address you are subscribed with? I've been caught by
>>> that one before elsewhere.
>> 
>> I had begun to suspect something similar myself... now testing with
>> the proper address; thanks for the heads-up!
>> 
>>> Thanks for the feedback. And David, Chris, and Mark too.
>>> 
>>> Bryan, nice to find someone working on a similar problem. I'll 
>>> check out your work. At first glace your ($indices,$values) pairs
>>> might be similar to how I have addressed the problem up to now.
>>> Not within PDL, but as a regular Perl hash.
>>> 
>>> Your logarithmic search time sounds interesting. I get linear 
>>> time searching for word pairs. Frankly I was glad to get it down
>>>  from exponential time in the size of the vectors.
>> 
>> Amen.
>> 
>>> It may not be the same search we are talking about. I'm searching
>>> for pairs between every element of a vector, and every element of
>>> another.
>> 
>> Ah, that's a bit trickier.  I've written a helper routine 
>> PDL::CCS::Ops::ccs_binop_align_mia() to do basically just this, 
>> which I use to implement the standard binary arithmetic operations.
>> The major gotcha is that "missing" (read 'zero') values in either
>> operand do not get aligned at all (which is why matrix 
>> multiplication rsp. dot-product are such a PITA); hence the "_mia"
>>  suffix (= 'missing-is-annihiliator'). It's worst-case quadratic 
>> time, but for very sparse pdls (including my co-occurrence 
>> matrices), it's much closer to linear (sorted parallel search of 
>> operands' index vectors).  It requires a bit of twiddling on the 
>> perl end to ensure that all vectors get aligned (block-wise 
>> iteration) because PDL needs to know the size of the output pdls 
>> before they've been computed, which is generally A Bad Thing for 
>> sparse matrix operations.
>> 
>>> However I do it though my memory usage just explodes when I try
>>> a cross product. Too many intermediate products being summed, 
>>> nesting two or three levels of substitutions over 10k+ 
>>> dimensions. I need to keep all intermediate products in RAM at 
>>> once or it is way too slow. But with it all in RAM the memory 
>>> usage blows out.
>> 
>> Yup.  Cross-product is nasty even with sparse pdls.  Occassionally
>>  I had to drill open the CCS::Nd objects (really just perl arrays)
>>  and work directly on the underlying ($indices,$values) pairs -- 
>> cross-product is pretty simple to define by direct manipulation of
>>  $indices, although memory usage is still O(Nnz**2), where Nnz is 
>> the number of non-zero values.
>> 
>> marmosets, Bryan
> 

-- 
Bryan Jurish                       "There is *always* one more bug."
[email protected]       -Lubarsky's Law of Cybernetic Entomology

_______________________________________________
Perldl mailing list
[email protected]
http://mailman.jach.hawaii.edu/mailman/listinfo/perldl

Reply via email to