[GitHub] piiswrong commented on a change in pull request #9514: Language Modeling Datasets and Sampler

2018-01-22 Thread GitBox
piiswrong commented on a change in pull request #9514: Language Modeling 
Datasets and Sampler
URL: https://github.com/apache/incubator-mxnet/pull/9514#discussion_r163119435
 
 

 ##
 File path: python/mxnet/gluon/data/sampler.py
 ##
 @@ -136,3 +136,30 @@ def __len__(self):
 raise ValueError(
 "last_batch must be one of 'keep', 'discard', or 'rollover', " \
 "but got %s"%self._last_batch)
+
+
+class IntervalSampler(Sampler):
+"""Samples elements from [0, length) at fixed intervals.
+
+Parameters
+--
+length : int
+Length of the sequence.
+
+Examples
+
+>>> sampler = gluon.data.IntervalSampler(13, interval=3)
+>>> list(sampler)
+[0, 3, 6, 9, 12, 1, 4, 7, 10, 2, 5, 8, 11]
 
 Review comment:
   This doesn't seem very generic anyway. I would put it in examples


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] piiswrong commented on a change in pull request #9514: Language Modeling Datasets and Sampler

2018-01-22 Thread GitBox
piiswrong commented on a change in pull request #9514: Language Modeling 
Datasets and Sampler
URL: https://github.com/apache/incubator-mxnet/pull/9514#discussion_r163113390
 
 

 ##
 File path: python/mxnet/gluon/data/sampler.py
 ##
 @@ -136,3 +136,30 @@ def __len__(self):
 raise ValueError(
 "last_batch must be one of 'keep', 'discard', or 'rollover', " \
 "but got %s"%self._last_batch)
+
+
+class IntervalSampler(Sampler):
+"""Samples elements from [0, length) at fixed intervals.
+
+Parameters
+--
+length : int
+Length of the sequence.
+
+Examples
+
+>>> sampler = gluon.data.IntervalSampler(13, interval=3)
+>>> list(sampler)
+[0, 3, 6, 9, 12, 1, 4, 7, 10, 2, 5, 8, 11]
 
 Review comment:
   The name interval sampler suggests it should behave like `[begin:end:step]`


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] piiswrong commented on a change in pull request #9514: Language Modeling Datasets and Sampler

2018-01-22 Thread GitBox
piiswrong commented on a change in pull request #9514: Language Modeling 
Datasets and Sampler
URL: https://github.com/apache/incubator-mxnet/pull/9514#discussion_r163113390
 
 

 ##
 File path: python/mxnet/gluon/data/sampler.py
 ##
 @@ -136,3 +136,30 @@ def __len__(self):
 raise ValueError(
 "last_batch must be one of 'keep', 'discard', or 'rollover', " \
 "but got %s"%self._last_batch)
+
+
+class IntervalSampler(Sampler):
+"""Samples elements from [0, length) at fixed intervals.
+
+Parameters
+--
+length : int
+Length of the sequence.
+
+Examples
+
+>>> sampler = gluon.data.IntervalSampler(13, interval=3)
+>>> list(sampler)
+[0, 3, 6, 9, 12, 1, 4, 7, 10, 2, 5, 8, 11]
 
 Review comment:
   The name interval sampler suggests it should behave like [begin:end:step]


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] piiswrong commented on a change in pull request #9514: Language Modeling Datasets and Sampler

2018-01-22 Thread GitBox
piiswrong commented on a change in pull request #9514: Language Modeling 
Datasets and Sampler
URL: https://github.com/apache/incubator-mxnet/pull/9514#discussion_r163103170
 
 

 ##
 File path: python/mxnet/gluon/data/text.py
 ##
 @@ -0,0 +1,160 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+# coding: utf-8
+# pylint: disable=
+"""Text datasets."""
+__all__ = ['WikiText2', 'WikiText103']
+
+import io
+import os
+import zipfile
+import shutil
+import numpy as np
+
+from . import dataset
+from ..utils import download, check_sha1
+from ...contrib import text
+from ... import nd
+
+
+class WikiText2(dataset._DownloadedDataset):
+"""WikiText-2 word-level dataset for language modeling, from Salesforce 
research.
+
+From
+
https://einstein.ai/research/the-wikitext-long-term-dependency-language-modeling-dataset
+
+License: Creative Commons Attribution-ShareAlike
+
+Each sample is a vector of length equal to the specified sequence length.
+At the end of each sentence, an end-of-sentence token '' is added.
+
+Parameters
+--
+root : str, default '~/.mxnet/datasets/cifar10'
+Path to temp folder for storing data.
+segment : str, default 'train'
+Dataset segment. Options are 'train', 'validation', 'test'.
+indexer : :class:`~mxnet.contrib.text.indexer.TokenIndexer`, default None
 
 Review comment:
   1. I don't think so. Do you have reference of it being used somewhere?
   2. It is, but it is contrib API. If you want to use it directly then 
gluon.data.text need to be in gluon.contrib too.
   
   We do need to expose something like this. But it can't be TokenIndexer.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] piiswrong commented on a change in pull request #9514: Language Modeling Datasets and Sampler

2018-01-22 Thread GitBox
piiswrong commented on a change in pull request #9514: Language Modeling 
Datasets and Sampler
URL: https://github.com/apache/incubator-mxnet/pull/9514#discussion_r163100367
 
 

 ##
 File path: python/mxnet/gluon/data/sampler.py
 ##
 @@ -136,3 +136,30 @@ def __len__(self):
 raise ValueError(
 "last_batch must be one of 'keep', 'discard', or 'rollover', " \
 "but got %s"%self._last_batch)
+
+
+class IntervalSampler(Sampler):
+"""Samples elements from [0, length) at fixed intervals.
+
+Parameters
+--
+length : int
+Length of the sequence.
+
+Examples
+
+>>> sampler = gluon.data.IntervalSampler(13, interval=3)
+>>> list(sampler)
+[0, 3, 6, 9, 12, 1, 4, 7, 10, 2, 5, 8, 11]
 
 Review comment:
   why should it roll over at the end?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] piiswrong commented on a change in pull request #9514: Language Modeling Datasets and Sampler

2018-01-22 Thread GitBox
piiswrong commented on a change in pull request #9514: Language Modeling 
Datasets and Sampler
URL: https://github.com/apache/incubator-mxnet/pull/9514#discussion_r163099680
 
 

 ##
 File path: python/mxnet/gluon/data/text.py
 ##
 @@ -0,0 +1,160 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+# coding: utf-8
+# pylint: disable=
+"""Text datasets."""
+__all__ = ['WikiText2', 'WikiText103']
+
+import io
+import os
+import zipfile
+import shutil
+import numpy as np
+
+from . import dataset
+from ..utils import download, check_sha1
+from ...contrib import text
+from ... import nd
+
+
+class WikiText2(dataset._DownloadedDataset):
+"""WikiText-2 word-level dataset for language modeling, from Salesforce 
research.
+
+From
+
https://einstein.ai/research/the-wikitext-long-term-dependency-language-modeling-dataset
+
+License: Creative Commons Attribution-ShareAlike
+
+Each sample is a vector of length equal to the specified sequence length.
+At the end of each sentence, an end-of-sentence token '' is added.
+
+Parameters
+--
+root : str, default '~/.mxnet/datasets/cifar10'
+Path to temp folder for storing data.
+segment : str, default 'train'
+Dataset segment. Options are 'train', 'validation', 'test'.
+indexer : :class:`~mxnet.contrib.text.indexer.TokenIndexer`, default None
+The indexer to use for indexing the text dataset. If None, a default 
indexer is created.
+seq_len : int, default 35
+The sequence length of each sample, regardless of the sentence 
boundary.
+transform : function, default None
 
 Review comment:
   Dataset now has a transform API. Use that instead of adding transform 
callback to every dataset


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] piiswrong commented on a change in pull request #9514: Language Modeling Datasets and Sampler

2018-01-22 Thread GitBox
piiswrong commented on a change in pull request #9514: Language Modeling 
Datasets and Sampler
URL: https://github.com/apache/incubator-mxnet/pull/9514#discussion_r163099562
 
 

 ##
 File path: python/mxnet/gluon/data/text.py
 ##
 @@ -0,0 +1,160 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+# coding: utf-8
+# pylint: disable=
+"""Text datasets."""
+__all__ = ['WikiText2', 'WikiText103']
+
+import io
+import os
+import zipfile
+import shutil
+import numpy as np
+
+from . import dataset
+from ..utils import download, check_sha1
+from ...contrib import text
+from ... import nd
+
+
+class WikiText2(dataset._DownloadedDataset):
+"""WikiText-2 word-level dataset for language modeling, from Salesforce 
research.
+
+From
+
https://einstein.ai/research/the-wikitext-long-term-dependency-language-modeling-dataset
+
+License: Creative Commons Attribution-ShareAlike
+
+Each sample is a vector of length equal to the specified sequence length.
+At the end of each sentence, an end-of-sentence token '' is added.
 
 Review comment:
   if seq_len doesn't respect sentence boundary why should it end with eos?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] piiswrong commented on a change in pull request #9514: Language Modeling Datasets and Sampler

2018-01-22 Thread GitBox
piiswrong commented on a change in pull request #9514: Language Modeling 
Datasets and Sampler
URL: https://github.com/apache/incubator-mxnet/pull/9514#discussion_r163099217
 
 

 ##
 File path: python/mxnet/gluon/data/text.py
 ##
 @@ -0,0 +1,160 @@
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+# coding: utf-8
+# pylint: disable=
+"""Text datasets."""
+__all__ = ['WikiText2', 'WikiText103']
+
+import io
+import os
+import zipfile
+import shutil
+import numpy as np
+
+from . import dataset
+from ..utils import download, check_sha1
+from ...contrib import text
+from ... import nd
+
+
+class WikiText2(dataset._DownloadedDataset):
+"""WikiText-2 word-level dataset for language modeling, from Salesforce 
research.
+
+From
+
https://einstein.ai/research/the-wikitext-long-term-dependency-language-modeling-dataset
+
+License: Creative Commons Attribution-ShareAlike
+
+Each sample is a vector of length equal to the specified sequence length.
+At the end of each sentence, an end-of-sentence token '' is added.
+
+Parameters
+--
+root : str, default '~/.mxnet/datasets/cifar10'
+Path to temp folder for storing data.
+segment : str, default 'train'
+Dataset segment. Options are 'train', 'validation', 'test'.
+indexer : :class:`~mxnet.contrib.text.indexer.TokenIndexer`, default None
 
 Review comment:
   I wouldn't expose this to users.
   1. Indexer is not the standard term for this.
   2. This is contrib API and subject to change. Gluon Dataset should use a 
separate vocabulary API


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services