A small variation of Vishal's algorithm to eliminate the need of the
bitmap. I use a hash table of integers index by the keyword,
initialized to all 0. I also make use of the property that in the
shortest summary the first and the last keyword will appear exactly
once.
1. foreach word in document
hash[word]++; // by the end of this loop we should have
frequencies of all keywords
2. Do a preliminary check to see if each hash[keyword] has frequency
>= 0. If at least one has frequency = 0, stop and return null to
indicate summary not found.
4. startp = pointer to first word, endp = pointer to last word
3. for (; startp <= endp; startp++)
if (hash[*startp] == 1) // stop when keyword with frequency =
1 is found
break;
else
hash[*startp]--; // by end of this loop, startp will point
to the first word of the summary, i.e. keyword that appears once
4. for (; startp <= endp; endp--)
if (hash[*endp] == 1) // stop when keyword with frequency = 1
is found
break;
else
hash[*endp]--; // by end of this loop, endp will point to
the last word of the summary, i.e. keyword that appears once
5. return summary which is from startp to endp
Worst case is O(2N) = O(N). Also, if there are more than one shortest
summaries, this will return the last one.
-Shrenik
On Sep 24, 10:15 pm, Vishal <[EMAIL PROTECTED]> wrote:
> How about keeping two pointers - startp and endp.
> Keep a count of frequencies of keywords between startp and endp, both
> inclusive. We can use an array / hash table for this.
> Now, the minimum length substring has to start with a keyword and has to end
> with a keyword.
>
> 1. Initially startp=0 and endp=1. L = infinity
> 2. Increment endp till you encounter a keyword or it exceeds the total
> length. Update frequencies. Check if you have all the keywords. (This can be
> done in O(1) using a bitmap or similar). If it exceeds the total length,
> QUIT.
> 3. If the str(startp,endp) contains all keywords and length < L, save values
> of startp and endp.
> 4. Now, increment startp, till you get a keyword. If the str(startp,endp)
> still contains all keywords, update saved values of startp and endp.
> 5. Repeat step 4 till str(startp,endp) does NOT contain all keywords.
> 6. Goto step 2.
>
> The final values of startp and endp should give you the minimum summary.
> Since both values go from 0 to N-1, its O(N).
>
> ~Vishal
>
> On 9/24/07, daizisheng <[EMAIL PROTECTED]> wrote:
>
>
>
> > I think hash method is ok, at lease in expectation way it's O(n)
> > why not use it? it's very effeciently
>
> > I think there should be some worst case O(n) algorithm, but it will be
> > more complex and not as effecient as the above one in practise
>
> > Sticker 写道:
> > > Declare: this question is originally from Google. The original form
> > > is: given a document, how to find a shortest summary containing all
> > > the key words. After translated, it will be: given a sequence, how to
> > > find a shortest substring that contains all the items required.
> > > Example: a sequence "abaccdefgacel" and a set of key words "a", "c",
> > > "d" and "e". The shortest substring contianing all of the key words is
> > > "accde". Find one of such shortest substrings. In the original
> > > question, there is time complexity boundary O(N), where N is the
> > > length of the sequence.
>
> > > To me, this problem is rather questionable. So far my solution
> > > requires a hash table that gives no conflict and can locate a key word
> > > in O(1) time. Otherwise the time complexity will definitely be related
> > > to the number of key words.
>
> > > Does anyone have idea for a O(N) algorithm to solve this problem?
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups
"Algorithm Geeks" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at http://groups.google.com/group/algogeeks
-~----------~----~----~----~------~----~------~--~---