Re: [datatable-help] Efficiently checking value of other row in data.table

Matt Dowle Fri, 27 Jun 2014 12:18:12 -0700

Hi,


Not sure exactly what you need but looks interesting.

Something a bit like this ?

DT[ word == "good", .SD[ lag(word, N) != "not" ],  by=document]

Your idea being you don't want to have to repeat all the pre and postwords alongside each word but rather express it in the query. Makessense. Leads to classifying "not good" and "not very good" as bothnegative phrases I guess.


Matt


On 26/06/14 21:56, Matthew DeAngelis wrote:

Hello data.table gurus,
I have been using data.table to efficiently work with textual data andI love it for that purpose. I have transformed my data so that itlooks something like this:
word    document        position
I       1       1
have    1       2
transformed     1       3
my      1       4
data    1       5
so      2       1
that    2       2
it      2       3
looks   2       4
something       2       5
like    2       6
this    2       7
(I actually use a unique number for each word, so that I am able touse data.table's excellent features to do lightning-fast word counts.This has revolutionized my workflow over looping through text fileswith Perl.)
My problem is that I sometimes need to search for phrases or to selectwords based on their context (for instance, I may want to exclude aword if it is preceded by "not" or followed by a word that changes itsmeaning). Currently, I am using the solution here<http://stackoverflow.com/questions/11397771/r-data-table-grouping-for-lagged-regression> tocreate a new column for a word in another position, like this:
word    document        position        lead_word
I       1       1       have
have    1       2       transformed
transformed     1       3       my
my      1       4       data
data    1       5       NA
so      2       1       that
that    2       2       it
it      2       3       looks
looks   2       4       something
something       2       5       like
like    2       6       this
this    2       7       NA


using a command like: DT[,lead_word:=DT[list(document,position+1),word].
This approach has two problems, however. First, it consumes moreresources as the dataset grows. I am currently working with a filecontaining over 150 million rows, so adding a column is costly.Second, I may want to check both one and two words ahead, so that Ihave to add two columns, and this can quickly get out of hand.
Is there a better way to use data.table to check the value in a row Ndistance from the row of interest within a group and select a rowbased on that value? Perhaps the .I variable could be useful here?
I appreciate any suggestions.


Regards,
Matt


_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

Re: [datatable-help] Efficiently checking value of other row in data.table

Reply via email to