[FOSSology] technical questions FOSSology

2009-08-18 Thread Armijn Hemel
dear FOSSology people,

I have some technical questions regarding FOSSology. It basically comes
down to the following:

Is there a mechanism that given a string I can let FOSSology search a
knowledgebase, and it can come up with a list of possible packages where
this string can be found in?

An example, say I want to search for the following string:

This software is derived from the GNU GPL XviD codec

and I have populated the FOSSology database with (at least) a copy of
XviD sources, will it be possible that it will let me know in which file
this can be found?

If not, is this functionality planned?

armijn

-- 
Armijn Hemel, MSc

Loco (Loohuis Consulting)
Training, consultancy, hosting, and building
Specialized in Open Source solutions

http://www.loohuis-consulting.nl/

___
fossology mailing list
fossology@fossology.org
http://fossology.org/mailman/listinfo/fossology


Re: [FOSSology] technical questions FOSSology

2009-08-18 Thread Bob Gobeille


On Aug 18, 2009, at 9:56 AM, Armijn Hemel wrote:


dear FOSSology people,

I have some technical questions regarding FOSSology. It basically  
comes

down to the following:

Is there a mechanism that given a string I can let FOSSology search a
knowledgebase, and it can come up with a list of possible packages  
where

this string can be found in?

An example, say I want to search for the following string:

This software is derived from the GNU GPL XviD codec

and I have populated the FOSSology database with (at least) a copy of
XviD sources, will it be possible that it will let me know in which  
file

this can be found?

If not, is this functionality planned?


There is a way to do this in 1.1 (latest stable release) by defining  
license terms.  From the top menu it is in Organize  License   
Manage Terms.  However, this is being deprecated in the next release.


In the next release, it will be easy to add your own license and you  
could simply add This software is derived from the GNU GPL XviD  
codec as a license.  This will be easy to do in 1.2.


Another feature that I've been wanting to get to is an ad hoc string  
search but I'm not sure how useful it really is.  Since the results  
won't be stored in the db.  So do you want ad hoc string searches or  
an easy way to add a phrase, like you quoted, as a new license to look  
for?


Bob Gobeille
b...@fossology.org
___
fossology mailing list
fossology@fossology.org
http://fossology.org/mailman/listinfo/fossology


Re: [FOSSology] technical questions FOSSology

2009-08-18 Thread Armijn Hemel
hello Bob, others,

  No, it would not be a new license, it would be random strings, like
  print or fprintf statements from programs. Tools like Blackduck do  
  this.
  I'm OK with it if it is on the fly, but it would be a lot easier
 to  
  have
  it in FOSSology than having to rely on things like Google code
 search,
  which require a webinterface (at least, I think they do) and which
  requires having to send requests to Google.
 
 
 Hi Armijn,
 It's been on our to-do list http://fossology.org/task_list for some  
 time at a low priority because we have had no user requests (until  
 yours) for it.  It's on the list because I thought it would be  
 useful.  ;-)  If I get some time I'd like to play around with what
 it  
 would take to implement this.  Are you only interested in English  
 strings?

Well, I would be interested in random UTF-8 strings that are likely to
appear in source code :-)

These could be:

* function names
* printf/fprintf statements
* other strings that are in source code and that have been copied into a
binary

I don't see why that should be restricted to just one language. It would
be a lot harder to implement than plain UTF-8 searches. So, basically it
comes down to full text searches. I'm not sure how well PostgreSQL can
handle these (I think it can).

armijn

-- 
Armijn Hemel, MSc

Loco (Loohuis Consulting)
Training, consultancy, hosting, and building
Specialized in Open Source solutions

http://www.loohuis-consulting.nl/

___
fossology mailing list
fossology@fossology.org
http://fossology.org/mailman/listinfo/fossology


Re: [FOSSology] technical questions FOSSology

2009-08-18 Thread Bob Gobeille


On Aug 18, 2009, at 12:58 PM, Armijn Hemel wrote:


  Are you only interested in English
strings?


Well, I would be interested in random UTF-8 strings that are likely to
appear in source code :-)

These could be:

* function names
* printf/fprintf statements
* other strings that are in source code and that have been copied  
into a

binary

I don't see why that should be restricted to just one language. It  
would
be a lot harder to implement than plain UTF-8 searches. So,  
basically it

comes down to full text searches. I'm not sure how well PostgreSQL can
handle these (I think it can).


The dictionary, stemmer and stopwords are language specific.  Postgres  
can handle all of this.


Just a plan full text search will find function names, text in  
printf's, and strings in binaries because everything looks like a  
string (in my idea of a simple initial implementation).  In other  
words, the search won't distinguish between a function name and a  
string found elsewhere.  They would all be treated as strings without  
any other meaning.


My thought for the first implementation was something simple:
  1) scan each file loaded into the repository and create a full text  
index based on english strings.
  2) user can then do ad-hoc search for string.  So searching on Xvid  
would find files containing functions named xvid, printf strings with  
XviD, XVID in language files, ...  The search would return a list of  
files and some bytes around the string (something like grep).  This  
simple implementation is much like a fast grep with stemming,  
stopwords, and thesaurus.


Of course, lots of things could make this better.  For example,  
dumping symbol tables and saving them as symbols in the string index  
(this would be easy for an agent to do).  Or parsing code (programming  
language specific) to extract out function names, etc.  That way you  
could limit your search to function names, class names, symbol names,  
etc.  The next step would probably be to load in files of strings as  
symbols, text, function names, etc. and use that as a reference file,  
like we do for licenses.  The search would find files based on that  
whole lot of criteria and return a ranked list of hits.


Bob Gobeille
b...@fossology.org
___
fossology mailing list
fossology@fossology.org
http://fossology.org/mailman/listinfo/fossology