Bryan,

I did a quick search for your papers. Seems you're mostly looking for
"canonical" dimensions for texts. Perhaps you did some categories for
individual words earlier(?)

I'm characterizing words (in terms of contexts, or texts), so kind of
the inverse of characterizing texts. In particular what I'm doing
which is new is trying to find a principled way to syntactically
combine these word level representations, as vectors, without
abstracting them into categories first.

In detail it means I'm trying to build context vectors for unobserved
pairs of words, by substituting contexts for observed pairs between
their respective elements.

It relies linguistically on the assumption that words with common
contexts tend to behave the same, so they'll have pairs which have the
same contexts etc.

Anyway, this means I'm always searching for pairs of words which occur
in sequence in a text, so I can perform my pair search as a parallel
search indexed by word position in the text. It's linear on text size,
which is still a big win over exponential on vector size. It sounds
very similar to your linear parallel search on operands' index values.
Your operands' index values must have similar meaning. There's an
assumption that one sorting order is appropriate for all searches.

I'm not sure if I did better on size than O(Nnz**2). I'm pretty sure
that working with arrays was just too big for me, My limited
programming knowledge didn't reach to a way to use them sparsely. I
had to go to Perl hashes to get something sparse. I thought you'd done
the same. Am I now understanding rightly that your CCS's are arrays.
If your CCS's are a sparse implementation with arrays that might be a
win for me. Would they be a lot smaller than Perl hashes, do you
think?

-Rob

On Fri, Sep 30, 2011 at 4:50 PM, Bryan Jurish
<[email protected]> wrote:
> moin Rob,
>
> On 2011-09-29 14:16, Rob Freeman wrote:
>> Hi Bryan,
>>
>> I haven't seen this appear on the list yet. Are you sure you sent from
>> the address you are subscribed with? I've been caught by that one
>> before elsewhere.
>
> I had begun to suspect something similar myself... now testing with the
> proper address; thanks for the heads-up!
>
>> Thanks for the feedback. And David, Chris, and Mark too.
>>
>> Bryan, nice to find someone working on a similar problem. I'll check
>> out your work. At first glace your ($indices,$values) pairs might be
>> similar to how I have addressed the problem up to now. Not within PDL,
>> but as a regular Perl hash.
>>
>> Your logarithmic search time sounds interesting. I get linear time
>> searching for word pairs. Frankly I was glad to get it down from
>> exponential time in the size of the vectors.
>
> Amen.
>
>> It may not be the same
>> search we are talking about. I'm searching for pairs between every
>> element of a vector, and every element of another.
>
> Ah, that's a bit trickier.  I've written a helper routine
> PDL::CCS::Ops::ccs_binop_align_mia() to do basically just this, which I
> use to implement the standard binary arithmetic operations.  The major
> gotcha is that "missing" (read 'zero') values in either operand do not
> get aligned at all (which is why matrix multiplication rsp. dot-product
> are such a PITA); hence the "_mia" suffix (= 'missing-is-annihiliator').
>  It's worst-case quadratic time, but for very sparse pdls (including my
> co-occurrence matrices), it's much closer to linear (sorted parallel
> search of operands' index vectors).  It requires a bit of twiddling on
> the perl end to ensure that all vectors get aligned (block-wise
> iteration) because PDL needs to know the size of the output pdls before
> they've been computed, which is generally A Bad Thing for sparse matrix
> operations.
>
>> However I do it though my memory usage just explodes when I try a
>> cross product. Too many intermediate products being summed, nesting
>> two or three levels of substitutions over 10k+ dimensions. I need to
>> keep all intermediate products in RAM at once or it is way too slow.
>> But with it all in RAM the memory usage blows out.
>
> Yup.  Cross-product is nasty even with sparse pdls.  Occassionally I had
> to drill open the CCS::Nd objects (really just perl arrays) and work
> directly on the underlying ($indices,$values) pairs -- cross-product is
> pretty simple to define by direct manipulation of $indices, although
> memory usage is still O(Nnz**2), where Nnz is the number of non-zero values.
>
> marmosets,
>        Bryan

_______________________________________________
Perldl mailing list
[email protected]
http://mailman.jach.hawaii.edu/mailman/listinfo/perldl

Reply via email to