Hi all, wouldn't it be easier to let mkgmap report those words which appear in more than n (e.g. 20) roads and use that list to produce a user-defined list of stop-words?
Gerd > From: [email protected] > Date: Sat, 14 Feb 2015 15:06:16 +0100 > To: [email protected] > Subject: Re: [mkgmap-dev] mixed index branch merge > > Hi all, > > In French, from the top of my head, I can think of : > > Rue, Ruelle, Avenue, Boulevard, Quai, Chaussée, Route, Cour, Cours, Cité, > Chemin, Place, Esplanade, Passage, Allée, Carrefour, Sentier, Square, Villa. > > This list is without a doubt not complete but should cover more than 95% of > named addresses in France. > > They should only be ignored from index if they're in the first place and > followed by anything else. > > > Cheers, > Paco > > Le 14 févr. 2015 à 08:50, Marko Mäkelä <[email protected]> a écrit : > > > On Thu, Feb 12, 2015 at 01:24:29PM +0000, Steve Ratcliffe wrote: > >> So finally I will merge the mixed index branch. > > > > I believe that the database terminology for this is 'inverted index' or > > 'fulltext index'. > > > >> I think it would be best to selectively enable it per country along with > >> lists of names to avoid. This would be best done by people from or > >> familiar with the countries in question. > > > > In fulltext search, these are called 'stopwords'. > > > > It might not be necessary to do anything to for countries where street > > names are commonly written as a single word. Example: "Main Street" would > > be "Hauptstrasse" in German, "Huvudgatan" in Sweden and "Päätie" in > > Finnish. Only if the first part of the street name is a proper name such as > > a person's name, the second part could be written as a separate word, > > separated by a space or dash. > > > > That said, I guess it would still make sense to introduce some stopwords. > > Words that I can think of: > > > > Swedish: gata, gatan, gränd, gränden, stig, stigen, (stråk, stråket) > > Finnish: tie, katu, polku, kuja, (raitti, taival) > > German: Straße, Strasse, Weg, Allee, Chaussee > > Estonian: mnt, maantee, tn, tänav, pst, puiestee > > > > In Estonia, it seems to be common to write the tn, mnt or pst as a separate > > word. > > > > I could be missing some stopwords in Estonian and for German-speaking > > countries. Also, it could be that the French loan words Allee and Chaussee > > are sometimes accented. > > > > The Finnish and Swedish words that I have put in parenthesis should be very > > rare, typically used for ways for non-motorized traffic. I don't think > > that including them would pollute the index much. You might in fact want to > > search for such a name when you are looking for a nice walking or cycling > > route (i.e., you expect there to exist some > > random-famous-person-name-stråket, but you do not know the random name). > > > > Marko > > _______________________________________________ > > mkgmap-dev mailing list > > [email protected] > > http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev > > _______________________________________________ > mkgmap-dev mailing list > [email protected] > http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
_______________________________________________ mkgmap-dev mailing list [email protected] http://www.mkgmap.org.uk/mailman/listinfo/mkgmap-dev
