Our project requires near-real time searches and constant updating. The
data is currently stored in a MySQL database and the Lucene index is
updated as the database is modified
We have the search capability currently where we want it. However, we
are attempting to add the ability to "tag" documents in the
index/database. Since the data pots can be millions of records, we
don't want to update the lucene index for tagging, but instead have a
table of document IDs that we would like to be using to determine the
tag sets.
The best option I've so far found is to retrieve both list of IDs as an
integer array, sort them (so I only need one loop through), then loop
through and look for matches between the two. Attempting to use the
list of Lucene IDs in an "In" query fails because the number of
documents can be in the millions and MySQL chokes on it.
Any insight into how we could optimize this or do it?
Thanks in advance
Trevor Watson