Hey Ronak, In my (admittedly limited) experience, topic modeling data is sparse, especially if you're using a bag-of-words approach. If you are using the latter approach, you can probably do quite well using Text::SpeedyFx <https://metacpan.org/pod/Text::SpeedyFx>. It imposes a number of limitations on the analytical approach, but it'll consume a tiny fraction of the memory you would otherwise need and it'll work blazingly fast compared to a full numeric approach. It's easy to compute the cosine similarity between two documents using it <https://coderwall.com/p/284hja>, and the distribution even comes with a cosine similarity script (see the MetaCPAN index page <https://metacpan.org/release/Text-SpeedyFx> for the distribution).
If you need *exact* topic modeling using a bag-of-words approach, then you should probably work with short or long integer data types. (I'm assuming you're tracking word counts, so you don't need floating point representation). This will likely save your memory costs. If you go this route, you'll likely want to use PDL::Slatec for your matrix operations... which means you'll have to make sure it's installed. If you are not using a bag-of-words approach, then Bryan Jurish's modules <https://metacpan.org/author/MOOCOW> may come in handy, especially PDL::CCS <https://metacpan.org/pod/PDL::CCS>. David On Fri, Sep 5, 2014 at 11:49 AM, Craig DeForest <[email protected]> wrote: > If your matrix is not necessarily sparse, you will have to process it all > through memory. PDL is optimized for problems that fit in your machine's > RAM limit. 15000x15000 floats is 900 MB, which should fit within most > machines. (15000x15000 double-precision values is 1.8 GB, which should > also be OK). You'll need to set the global variable $PDL::BIGPDL to 1 to > let Perl know you plan to work with arrays that large. > > My laptop computer has 16GB of RAM. This works fine: > > use PDL; > $a = random(15000,15000); # generate 15000x15000 array of random numbers > $b = random(15000,15000); # generate another one > > If you're running out of memory you may be trying to do something silly > like read all the numbers in as Perl scalars...? > > On the other hand, this may take a while: > > $c = $a x $b; # brute-force matrix multiply -- ~200 hours to complete > > The reason is that the final step requires (8 * 15000 * 3 * 15000 * 15000) > memory > accesses. > > Finding eigenvalues of a 15000x15000 matrix is a nontrivial process. PDL > has an eigenvalue solver ("eigens") but it is a general purpose tool for > small matrices, it would take considerably longer than the age of the > Universe to find the eigenvalues of a 15000x15000 nonsparse matrix -- so > your project might be a little late if you use that. > > Working with large matrices is its own computational subject. PDL makes a > nice framework for it, but for any serious operations you can't just use > the kind of general purpose tools that work fine on (say) a 10x10 matrix. > > > > On Sep 5, 2014, at 9:12 AM, Ronak Agrawal <[email protected]> wrote: > > Thank You Sir for the early response. > > I am new to Perl and have been assigned project on Topic Modeling where I > have to search, browse and find information from large archives of texts. > > Matrix operation is one of the operation and as per requirement my matrix > may be sparse or dense. Is it possible for you help me with both the cases. > > More to that can you tell me some good methods to handle large data in > Perl. > > Once again thank you for the response > > > On Fri, Sep 5, 2014 at 7:36 PM, Craig DeForest <[email protected]> > wrote: > >> Glad to help. First, a few questions. Is the matrix sparse? (i.e. are >> less than, say 10^-3 of the elements nonzero?) How close to tridiagonal is >> it? >> >> >> On Sep 5, 2014, at 6:27 AM, Ronak Agrawal <[email protected]> wrote: >> >> *Hi* >> >> *I am doing a project in Topic Modelling which involves large matrix >> operations.* >> >> *I have a sql database from where I have to generate 15000 x 15000 matix - >> transform and obtain A'A.Later I have to find Eigen Values and Eigen >> Vectors.* >> >> *Can you suggest me ways to do this in Perl.I get "Out of Memory" while >> storing the matrix in memory.* >> >> *Your input will help in handling big data and therby making my project >> success* >> >> Thank You >> >> Ronak >> >> _______________________________________________ >> Perldl mailing list >> [email protected] >> http://mailman.jach.hawaii.edu/mailman/listinfo/perldl >> >> >> > > > _______________________________________________ > Perldl mailing list > [email protected] > http://mailman.jach.hawaii.edu/mailman/listinfo/perldl > > -- "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it." -- Brian Kernighan
_______________________________________________ Perldl mailing list [email protected] http://mailman.jach.hawaii.edu/mailman/listinfo/perldl
