Hi R experts

I have the following regular expression problem. I am writing a basic corpus 
retrieval program, i.e. a concordancer/function where a user enters
- a set or a directory of text files to search;
- a regular expression to search for in these files.

I want to provide an output in which the matches of the regular expression are 
listed in one central column and the neighboring columns given the words before 
and after the matching word. For example, a concordance of the word "the" for 
the previous sentence with a user-defined span of 3 would lool like this:
-3      -2      -1      0       1       2       3
output  in      which   the     matches of      the
the     matches of      the     regular expression      are
central column  and     the     neighboring     columns given
neighboring     columns given   the     words   before  and
before  and     after   the     matching        word    .

As you can see, there may be multiple hits per line. This works all perfectly 
fine for cases where the regular expression matches just one of the kind of 
elements to be separated in the table. 'Unfortunately', apart from 'normal' 
text files, I also have text files in which every word is preceded by a tag 
giving its word class, for example

a<-c("<w TO0>to <w VV1>find <w VVN>expected <w TO0>to <w VV2>skivvy <w DT0>much 
<c PUN>.",
     "<w VVN>seen <w TO0>to <w VV3>kill <w DT0>many")

Now, as long as the regular expression entered by the user is something like
   b<-<w TO0>to
or even
   b<-(?Ui)<w VVN>[^<]*<
this works fine: I identify hits using grep(b, a, perl=T), split up the line 
using strsplit, and provide as many words before and after my search string as 
are necessary (and available in the line).

But if the regular expression entered by a user (when prompted by scan(nmax=1, 
what="char")) is
   b<-b<-"(?Ui)(<w TO0>to <w VV.>[^<]*<)"
I run into several related problems. As you all know, grep and regexpr will 
only give me the first hit anyway - which is how I identified the lines in the 
first place - but for the desired output I need all the hits per line together 
with their context. But, obviously, when I split up the line using strplit and 
"<w " as a separator so that I can get all hits and all words for the columns 
-3 to -1 and 1 to 3, the expression matched by the search string b is also 
split up and cannot be put into one tab-separated central column anymore and I 
don't seem to be able to extract all hits to store them and insert them again 
at a later stage ... Basically, I need to split up the element of the vector 
containing at least one match into x parts, where x is the number of hits plus 
the number of elements when the surrounding material is split up so that I can 
generate this kind of display (I leave aside the issue of spaces for now and 
transpose the above kind of display for expository reasons):

(the first hit in a[1])
-3      
-2      
-1      
0       <w TO0>to <w VV1>find
1       <w VVN>expected
2       <w TO0>to
3       <w VV2>skivvy

and the next line of the output would be the second hit in a[1]:

-3      <w TO0>to
-2      <w VV1>find
-1      <w VVN>expected
0       <w TO0>to <w VV2>skivvy
1       
2       
3       

and the next line would be the only hit in a[2]. The short question after this 
long intro now is, is there any way of splitting up the elements containing 
matches in such a way?

I use R 2.1.1 on a Windows XP Pro SP2 machine (with Perl 5.8.7 in case that 
matters for PRCE). Thanks,
STG


Machen Sie aus 14 Cent spielend bis zu 100 Euro!
Die neue Gaming-Area von Arcor - über 50 Onlinespiele im Angebot.
http://www.arcor.de/rd/emf-gaming-1

______________________________________________
R-help@stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html

Reply via email to