Re: Gdata - Indexing feeds and entries

Otis Gospodnetic Wed, 19 Jul 2006 22:06:50 -0700

Hi Simon,

I'm not sure if I already replied to this or not.
Here are some thoughts.
Distributed indexing:
- you could take the Solr approach and have a Master indexing server that 
periodically takes snapshops and tells Slave servers "hey, come get the new 
stuff".  The problem is that the Master is the single point of failure.
- you could take a similar replication approach with DRDB 
(http://www.drbd.org/) or some such
- you could accept new entries in one place but delegate the indexing to 
multiple instances of the GData server in parallel


As for searching, you could simply partition the traffic instead of 
partitioning the index.  Not the same thing clearly, but it's probably simpler 
to do (throw a load balancer/proxy in front of the search servers).  If you 
want to partition the index, you could simply employ some logic that specifies 
the maximal size of the index.  Until that limit is reached you index to the 
current index.  Once the limit is reached you start a new index, possibly on a 
new server if that is available, or you start a new index and migrate the 
closed index elsewhere.

I imagine Yonik, Doug, and others will have other ideas, too.

Otis

----- Original Message ----
From: Simon Willnauer <[EMAIL PROTECTED]>
To: [email protected]
Sent: Saturday, July 15, 2006 10:37:11 AM
Subject: Gdata - Indexing feeds and entries

Hi there,

it has been quiet about Gdata the last 2 weeks but all the exams are
done and uni has finished yesterday so next round can start up.
OK what needs to be done, the gdata protocol describes a kind of a
query language to query feed for full text search in defined xml
elements and / or custom elements. For that purpose the stored,
updated and deleted entries have to be reflected into the search
component to be available for searching.The indexer component of the
server has to notified about modification events to keep the index
uptodate.
I'm not yet sure how the fields / elements of the xml will be
configured but I guess I will look for some ideas in solr or nutch and
discuss that later.
My first and main problem is pretty well know on this mailinglist,
well I found lots of questions and suggestions via google but these
discussions are quite a while ago. I was wondering if there are some
new cognitions about distributed searching / indexing. The server
should be able to run in clusters/ server farms so indexed data must
be available on each server / machine. I thought about this for a
while and all my ideas seem to be problematic in a certain way.
i found this thread on the mailing list
http://www.mail-archive.com/[email protected]/msg12700.html

which gives a lot of information about the problem I'm facing.

It would be great if some of you experienced guys could give me
information about your experience / solution to this problem. If you
see any possibility to provide such a mechanism as a generic solution
we could we could separate this as a new contrib project after SoC has
finished e.g. detach it from gdata.

thanks in advance for your  help ;)


Simon





---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Gdata - Indexing feeds and entries

Reply via email to