Hi Matt,

Great. If you can prepare some dummy data with the appropriate properties and a parameter or two to scale up the size (or just provide an online large example to download) and a query that gets to the right answer but is slow or ugly, then we've got something to chew on ...

Matt

On 28/06/14 10:55, Matthew DeAngelis wrote:
Hi Matt,

You have the right of it. The problem is somewhat complicated, however, since I would want to substitute "DT[word=="good"..." with "DT[J("good")..." after setting the key to word and reordering the rows. Hence the two-step process I have now where I key by document and position first, create the lag_word column, key by the word and lag_word columns and query by row.


Matt


On Fri, Jun 27, 2014 at 3:17 PM, Matt Dowle <[email protected] <mailto:[email protected]>> wrote:


    Hi,

    Not sure exactly what you need but looks interesting.

    Something a bit like this ?

    DT[ word == "good", .SD[ lag(word, N) != "not" ], by=document]

    Your idea being you don't want to have to repeat all the pre and
    post words alongside each word but rather express it in the query.
    Makes sense.   Leads to classifying "not good" and "not very good"
    as both negative phrases I guess.

    Matt



    On 26/06/14 21:56, Matthew DeAngelis wrote:
    Hello data.table gurus,

    I have been using data.table to efficiently work with textual
    data and I love it for that purpose. I have transformed my data
    so that it looks something like this:

    word        document        position
    I   1       1
    have        1       2
    transformed         1       3
    my  1       4
    data        1       5
    so  2       1
    that        2       2
    it  2       3
    looks       2       4
    something   2       5
    like        2       6
    this        2       7


    (I actually use a unique number for each word, so that I am able
    to use data.table's excellent features to do lightning-fast word
    counts. This has revolutionized my workflow over looping through
    text files with Perl.)

    My problem is that I sometimes need to search for phrases or to
    select words based on their context (for instance, I may want to
    exclude a word if it is preceded by "not" or followed by a word
    that changes its meaning). Currently, I am using the solution
    here
    
<http://stackoverflow.com/questions/11397771/r-data-table-grouping-for-lagged-regression>
 to
    create a new column for a word in another position, like this:

    word        document        position        lead_word
    I   1       1       have
    have        1       2       transformed
    transformed         1       3       my
    my  1       4       data
    data        1       5       NA
    so  2       1       that
    that        2       2       it
    it  2       3       looks
    looks       2       4       something
    something   2       5       like
    like        2       6       this
    this        2       7       NA


    using a command like:
    DT[,lead_word:=DT[list(document,position+1),word].

    This approach has two problems, however. First, it consumes more
    resources as the dataset grows. I am currently working with a
    file containing over 150 million rows, so adding a column is
    costly. Second, I may want to check both one and two words ahead,
    so that I have to add two columns, and this can quickly get out
    of hand.

    Is there a better way to use data.table to check the value in a
    row N distance from the row of interest within a group and select
    a row based on that value? Perhaps the .I variable could be
    useful here?

    I appreciate any suggestions.


    Regards,
    Matt


    _______________________________________________
    datatable-help mailing list
    [email protected]  
<mailto:[email protected]>
    https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help



_______________________________________________
datatable-help mailing list
[email protected]
https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

Reply via email to