[jira] Commented: (MAHOUT-522) Using different data sources for input/output

Drew Farris (JIRA) Mon, 04 Oct 2010 06:59:58 -0700

    [ 
https://issues.apache.org/jira/browse/MAHOUT-522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12917582#action_12917582
 ]


Drew Farris commented on MAHOUT-522:
------------------------------------

Yes Oleksandr, please share some patches. It would be great to see your ideas 
in implementation, even if they are rough sketches of ideas in code.

As far as database input is concerned, it's difficult to predict just what 
format or schema any one potential mahout user's data is stored. To that end it 
 makes a great deal of sense to move towards api oriented access, Mahout needs 
to provide api's for reading and writing various bits of data while users 
implement the necessary bits of glue to hook them up to their data sources.

You don't mention what portion of Mahout you're most interested in, where do 
you propose to get started?

 

> Using different data sources for input/output
> ---------------------------------------------
>
>                 Key: MAHOUT-522
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-522
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Utils
>            Reporter: Oleksandr Petrov
>
> Hi,
> Mahout is currently bound to the file system, at least from my feeling. Most 
> of the time data structures i'm working with aren't located on the file 
> system, the same way as output isn't bound to the FS, most of time i'm forced 
> to export my datasets from DB to FS, and then load them back to DB afterwards.
> Most likely, it's not quite interesting for the core developers, who're 
> working on the algorithms implementation to start writing adapters to DBs or 
> anything like that.
> For instance,  SequenceFilesFromDirectory is a simple way to get your files 
> from directory and convert it all to Sequence Files. Some people would be 
> extremely grateful if there would be an interface they may implement to throw 
> their files from DB straight to the Sequence File without a medium of a File 
> System. If anyone's interested, i can provide a patch.
> Second issue is related to the workflow process itself. For instance, what if 
> i already do have Dictionary, TF-IDF, and TF in some particular format that 
> was created by other things in my infrastructure. Again, I need to convert 
> those to the Mahout data-structures. Can't we just allow other jobs to accept 
> more generic types (or interfaces, for instance) when working with TF-IDF, TF 
> and Dictionaries, without binding those to Hadoop FS. 
> I do realize that Mahout is a part of Lucene/Hadoop infrastructure, but it's 
> also an independent project, so it may benefit and get a more wide adoption, 
> if it allows to work with any format. I do have an idea of how to implement 
> it, and partially implemented it for our infrastructure needs, but i really 
> want to hear some output from users and hadoop developers, whether it's 
> suitable and if anyone may benefit out of that.
> Thank you!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-522) Using different data sources for input/output

Reply via email to