[
https://issues.apache.org/jira/browse/SOLR-10317?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16011667#comment-16011667
]
Vivek Narang edited comment on SOLR-10317 at 5/16/17 3:08 AM:
--------------------------------------------------------------
Hello [~ichattopadhyaya] I also found out another dataset which can be used for
some scenarios/features. Please see
[https://www.kaggle.com/snap/amazon-fine-food-reviews]. This data set is huge
with over half a million records and ten fields. There is a good mix of text
and numeric fields. The current indexing time as observed is 1222 seconds on a
standalone node. Please access the file (~250MB) here:
[http://162.243.101.83/Reviews.csv]. I think this is awesome! Please let me
know what you think. Thanks.
--- Data Structure Details ---
Id
ProductId - unique identifier for the product
UserId - unqiue identifier for the user
ProfileName
HelpfulnessNumerator - number of users who found the review helpful
HelpfulnessDenominator - number of users who indicated whether they found the
review helpful
Score - rating between 1 and 5
Time - timestamp for the review
Summary - brief summary of the review
Text - text of the review
---
was (Author: [email protected]):
Hello [~ichattopadhyaya] I also found out another dataset which can be used for
some scenarios/features. Please see
[https://www.kaggle.com/snap/amazon-fine-food-reviews]. This data set is huge
with over half a million records and ten fields. There is a good mix of text
and numeric fields. The current indexing time as observed is 1222 seconds on a
standalone node. Please access the file (~250MB) here:
[http://162.243.101.83/Reviews.csv]. I think this is awesome! Please let me
know what you think. Thanks.
> Solr Nightly Benchmarks
> -----------------------
>
> Key: SOLR-10317
> URL: https://issues.apache.org/jira/browse/SOLR-10317
> Project: Solr
> Issue Type: Task
> Reporter: Ishan Chattopadhyaya
> Labels: gsoc2017, mentor
> Attachments: changes-lucene-20160907.json,
> changes-solr-20160907.json, managed-schema,
> Narang-Vivek-SOLR-10317-Solr-Nightly-Benchmarks.docx,
> Narang-Vivek-SOLR-10317-Solr-Nightly-Benchmarks-FINAL-PROPOSAL.pdf,
> solrconfig.xml
>
>
> Solr needs nightly benchmarks reporting. Similar Lucene benchmarks can be
> found here, https://home.apache.org/~mikemccand/lucenebench/.
> Preferably, we need:
> # A suite of benchmarks that build Solr from a commit point, start Solr
> nodes, both in SolrCloud and standalone mode, and record timing information
> of various operations like indexing, querying, faceting, grouping,
> replication etc.
> # It should be possible to run them either as an independent suite or as a
> Jenkins job, and we should be able to report timings as graphs (Jenkins has
> some charting plugins).
> # The code should eventually be integrated in the Solr codebase, so that it
> never goes out of date.
> There is some prior work / discussion:
> # https://github.com/shalinmangar/solr-perf-tools (Shalin)
> # https://github.com/chatman/solr-upgrade-tests/blob/master/BENCHMARKS.md
> (Ishan/Vivek)
> # SOLR-2646 & SOLR-9863 (Mark Miller)
> # https://home.apache.org/~mikemccand/lucenebench/ (Mike McCandless)
> # https://github.com/lucidworks/solr-scale-tk (Tim Potter)
> There is support for building, starting, indexing/querying and stopping Solr
> in some of these frameworks above. However, the benchmarks run are very
> limited. Any of these can be a starting point, or a new framework can as well
> be used. The motivation is to be able to cover every functionality of Solr
> with a corresponding benchmark that is run every night.
> Proposing this as a GSoC 2017 project. I'm willing to mentor, and I'm sure
> [~shalinmangar] and [[email protected]] would help here.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]