[FOSSology] technical questions FOSSology
dear FOSSology people, I have some technical questions regarding FOSSology. It basically comes down to the following: Is there a mechanism that given a string I can let FOSSology search a knowledgebase, and it can come up with a list of possible packages where this string can be found in? An example, say I want to search for the following string: This software is derived from the GNU GPL XviD codec and I have populated the FOSSology database with (at least) a copy of XviD sources, will it be possible that it will let me know in which file this can be found? If not, is this functionality planned? armijn -- Armijn Hemel, MSc Loco (Loohuis Consulting) Training, consultancy, hosting, and building Specialized in Open Source solutions http://www.loohuis-consulting.nl/ ___ fossology mailing list fossology@fossology.org http://fossology.org/mailman/listinfo/fossology
Re: [FOSSology] technical questions FOSSology
On Aug 18, 2009, at 9:56 AM, Armijn Hemel wrote: dear FOSSology people, I have some technical questions regarding FOSSology. It basically comes down to the following: Is there a mechanism that given a string I can let FOSSology search a knowledgebase, and it can come up with a list of possible packages where this string can be found in? An example, say I want to search for the following string: This software is derived from the GNU GPL XviD codec and I have populated the FOSSology database with (at least) a copy of XviD sources, will it be possible that it will let me know in which file this can be found? If not, is this functionality planned? There is a way to do this in 1.1 (latest stable release) by defining license terms. From the top menu it is in Organize License Manage Terms. However, this is being deprecated in the next release. In the next release, it will be easy to add your own license and you could simply add This software is derived from the GNU GPL XviD codec as a license. This will be easy to do in 1.2. Another feature that I've been wanting to get to is an ad hoc string search but I'm not sure how useful it really is. Since the results won't be stored in the db. So do you want ad hoc string searches or an easy way to add a phrase, like you quoted, as a new license to look for? Bob Gobeille b...@fossology.org ___ fossology mailing list fossology@fossology.org http://fossology.org/mailman/listinfo/fossology
Re: [FOSSology] technical questions FOSSology
hello Bob, others, No, it would not be a new license, it would be random strings, like print or fprintf statements from programs. Tools like Blackduck do this. I'm OK with it if it is on the fly, but it would be a lot easier to have it in FOSSology than having to rely on things like Google code search, which require a webinterface (at least, I think they do) and which requires having to send requests to Google. Hi Armijn, It's been on our to-do list http://fossology.org/task_list for some time at a low priority because we have had no user requests (until yours) for it. It's on the list because I thought it would be useful. ;-) If I get some time I'd like to play around with what it would take to implement this. Are you only interested in English strings? Well, I would be interested in random UTF-8 strings that are likely to appear in source code :-) These could be: * function names * printf/fprintf statements * other strings that are in source code and that have been copied into a binary I don't see why that should be restricted to just one language. It would be a lot harder to implement than plain UTF-8 searches. So, basically it comes down to full text searches. I'm not sure how well PostgreSQL can handle these (I think it can). armijn -- Armijn Hemel, MSc Loco (Loohuis Consulting) Training, consultancy, hosting, and building Specialized in Open Source solutions http://www.loohuis-consulting.nl/ ___ fossology mailing list fossology@fossology.org http://fossology.org/mailman/listinfo/fossology
Re: [FOSSology] technical questions FOSSology
On Aug 18, 2009, at 12:58 PM, Armijn Hemel wrote: Are you only interested in English strings? Well, I would be interested in random UTF-8 strings that are likely to appear in source code :-) These could be: * function names * printf/fprintf statements * other strings that are in source code and that have been copied into a binary I don't see why that should be restricted to just one language. It would be a lot harder to implement than plain UTF-8 searches. So, basically it comes down to full text searches. I'm not sure how well PostgreSQL can handle these (I think it can). The dictionary, stemmer and stopwords are language specific. Postgres can handle all of this. Just a plan full text search will find function names, text in printf's, and strings in binaries because everything looks like a string (in my idea of a simple initial implementation). In other words, the search won't distinguish between a function name and a string found elsewhere. They would all be treated as strings without any other meaning. My thought for the first implementation was something simple: 1) scan each file loaded into the repository and create a full text index based on english strings. 2) user can then do ad-hoc search for string. So searching on Xvid would find files containing functions named xvid, printf strings with XviD, XVID in language files, ... The search would return a list of files and some bytes around the string (something like grep). This simple implementation is much like a fast grep with stemming, stopwords, and thesaurus. Of course, lots of things could make this better. For example, dumping symbol tables and saving them as symbols in the string index (this would be easy for an agent to do). Or parsing code (programming language specific) to extract out function names, etc. That way you could limit your search to function names, class names, symbol names, etc. The next step would probably be to load in files of strings as symbols, text, function names, etc. and use that as a reference file, like we do for licenses. The search would find files based on that whole lot of criteria and return a ranked list of hits. Bob Gobeille b...@fossology.org ___ fossology mailing list fossology@fossology.org http://fossology.org/mailman/listinfo/fossology