Hello data.table gurus, I have been using data.table to efficiently work with textual data and I love it for that purpose. I have transformed my data so that it looks something like this:
worddocumentpositionI11have12transformed13my14data15so21that22it23looks24 something25like26this27 (I actually use a unique number for each word, so that I am able to use data.table's excellent features to do lightning-fast word counts. This has revolutionized my workflow over looping through text files with Perl.) My problem is that I sometimes need to search for phrases or to select words based on their context (for instance, I may want to exclude a word if it is preceded by "not" or followed by a word that changes its meaning). Currently, I am using the solution here <http://stackoverflow.com/questions/11397771/r-data-table-grouping-for-lagged-regression> to create a new column for a word in another position, like this: worddocumentpositionlead_wordI11havehave12transformedtransformed13mymy14data data15NAso21thatthat22itit23lookslooks24somethingsomething25likelike26this this27NA using a command like: DT[,lead_word:=DT[list(document,position+1),word]. This approach has two problems, however. First, it consumes more resources as the dataset grows. I am currently working with a file containing over 150 million rows, so adding a column is costly. Second, I may want to check both one and two words ahead, so that I have to add two columns, and this can quickly get out of hand. Is there a better way to use data.table to check the value in a row N distance from the row of interest within a group and select a row based on that value? Perhaps the .I variable could be useful here? I appreciate any suggestions. Regards, Matt
_______________________________________________ datatable-help mailing list [email protected] https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help
