[datatable-help] Efficiently checking value of other row in data.table

Matthew DeAngelis Thu, 26 Jun 2014 13:57:04 -0700

Hello data.table gurus,

I have been using data.table to efficiently work with textual data and I
love it for that purpose. I have transformed my data so that it looks
something like this:


worddocumentpositionI11have12transformed13my14data15so21that22it23looks24
something25like26this27
(I actually use a unique number for each word, so that I am able to use
data.table's excellent features to do lightning-fast word counts. This has
revolutionized my workflow over looping through text files with Perl.)

My problem is that I sometimes need to search for phrases or to select
words based on their context (for instance, I may want to exclude a word if
it is preceded by "not" or followed by a word that changes its meaning).
Currently, I am using the solution here
<http://stackoverflow.com/questions/11397771/r-data-table-grouping-for-lagged-regression>
to
create a new column for a word in another position, like this:

worddocumentpositionlead_wordI11havehave12transformedtransformed13mymy14data
data15NAso21thatthat22itit23lookslooks24somethingsomething25likelike26this
this27NA
using a command like: DT[,lead_word:=DT[list(document,position+1),word].

This approach has two problems, however. First, it consumes more resources
as the dataset grows. I am currently working with a file containing over
150 million rows, so adding a column is costly. Second, I may want to check
both one and two words ahead, so that I have to add two columns, and this
can quickly get out of hand.

Is there a better way to use data.table to check the value in a row N
distance from the row of interest within a group and select a row based on
that value? Perhaps the .I variable could be useful here?

I appreciate any suggestions.


Regards,
Matt

_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

[datatable-help] Efficiently checking value of other row in data.table

Reply via email to